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Abstract: We present a theory of point and interval estimation for nonlinear 
functionals in parametric, semi-, and non-parametric models based on higher 
order influence functions (Robins (2004), Section 9; Li ct al. (2004), Tchetgen 
et al. (2006), Robins et al. (2007)). Higher order influence functions are higher 
order U-statistics. Our theory extends the first order semiparametric theory of 
Bickel et al. (1993) and van der Vaart (1991) by incorporating the theory of 
higher order scores considered by Pfanzagl (1990), Small and McLeish (1994) 
and Lindsay and Waterman (1996). The theory reproduces many previous 
results, produces new non-y^ results, and opens up the ability to perform op- 
timal non-y'n inference in complex high dimensional models. We present novel 
, rate-optimal point and interval estimators for various functionals of central 

importance to biostatistics in settings in which estimation at the expected y/n 
rate is not possible, owing to the curse of dimensionality. We also show that our 
higher order influence functions have a multi-robustness property that extends 
the double robustness property of first order infiucnce functions described by 
, Robins and Rotnitzky (2001) and van der Laan and Robins (2003). 

o 

• 1. Introduction 

cn 

l/^ ' Over the past 3 years, we have developed a theory of point and interval estimation 

\ for nonlinear functionals ip (F) in parametric, semi-, and non-parametric models 

based on higher order likelihood scores and influence functions that applies equally 
to both y/n and non-^/n problems (Robins [1(>], Section 9, Li et al. [')], Tchetgen 
et al. [21], Robins et al. [lb]). The theory reproduces results previously obtained 
\ by the modern theory of non-parametric inference, produces many new non-y^ 

; I ■ results, and most importantly opens up the ability to perform non-y^ inference in 

I complex high dimensional models, such as models for the estimation of the causal 

effect of time varying treatments in the presence of time varying confounding and 
informative censoring. See Tchetgen et al. [2'2] for examples of the latter. 

Higher order influence functions are higher order U-statistics. Our theory extends 
the first order semiparametric theory of Bickel et al. [^)] and van der Vaart [2-^] by 
incorporating the theory of higher order scores and Bhattacharrya bases considered 
by Pfanzagl [11], Small and McLeish [20] and Lindsay and Waterman [s]. 
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The purpose of this paper is to demonstrate the scope and flexibility of our 
methodology by deriving rate-optimal point and interval estimators for various func- 
tionals that are of central importance to biostatistics. We now describe some of these 
functionals. We suppose we observe i.i.d copies of a random vector O = {Y,A,X) 
with unknown distribution F on each of n study subjects. In this paper, we largely 
study non-parametric models that place no restrictions on F, other than bounds 
on both the Lp norms and on the smoothness of certain density and conditional 
expectation functions. The variable X represents a random vector of baseline co- 
variatcs such as age, height, weight, hematocrit, and laboratory measures of lung, 
renal, liver, brain, and heart function. X is assumed to have compact support and a 
density fx {x) with respect to the Lebesgue measure in i?**, where, in typical appli- 
cations, d is in the range 5 to 100. A is a binary treatment and y is a response, higher 
values of which are desirable. Then, in the absence of confounding by additional un- 
measured factors, the functional ij; {F) = E {E [Y\A = 1, X]} - E {E [Y\A = 0, X]} 
is the mean effect of treatment in the total study population. Our results for 
E{E[Y\A = l,X]} - E{E[Y\A = {),X]} follow from results for the functional 
il){F)= E {E [Y\A = 1, X]} based on data [AY, A, X) rather than (F, A, X).liY 
is missing for some study subjects, and A is now the indicator that takes the value 
1 when Y is observed and zero otherwise, then the functional E {E \Y\A = 
is the marginal mean of Y under the missing at random assumption that the prob- 
ability P[A = Q\X,Y] = P[A = Q\X] that Y is missing does not depend on the 
unobserved Y . 

Returning to data O ~ {Y,A,X), the functional 

iP{F) = E {cov (r, A\X)} /E [var {A\X}] 

= E[w {X) {E [Y\A ^1,X]-E [Y\A = 0, X]}] , 

with w {X) = var /E [var {A|X}] is the variance weighted average treatment 

effect. Our results for E {cov {Y, A\X)} / E [vai {A\X}] are derived from results for 



E 



{E{Y\X)r 



the functionals tP {F) = E {cov (Y, A\X)} and ip (F) 

We note that Robins and van der Vaart's [19] construction of an adaptive confi- 
dence set for a regression function E {Y\X = x) depended on being able to construct 

a confidence interval for tp (F) ~ E {E {Y\X)} 



for E 



{EiY\X)}' 



They constructed an interval 
when the marginal distribution of X was known. In this pa- 

when the marginal of 



per, we construct a confidence interval for E {E {Y\X)} 
X is unknown and, in Section 5, use it to obtain an adaptive confidence set for 
E{Y\X = x). 

The functional E {cov (Y, A\X)} is the functional E {var {Y\X)} in the special 
case in which Y = A w.p.l. Minimax estimation of var {Y\X) has recently been 
discussed by Wang et al. [27] and Cai et al. [('] in the setting of non-random X. 

The function (x) = E [Y\A ^ 1,X = x] - E [Y\A = 0,X = x] is the effect of 
treatment on the subgroup with X = x. It is important to estimate the function 
7 (x), in addition to the average treatment effect in the total population, because 
treatment should be given, since beneficial, to those subjects with 7 (a;) > but 
withheld, since harmful, from subjects with 7 (x) < 0. We show that one can 
obtain adaptive confidence sets for 7 (x) if one can set confidence intervals for the 



functional ij: {F) = E 7 {X) 



under the 



We construct intervals for E 

additional assumption that the data O = (Y, A, X) came from a randomized trial. 
In a randomized trial, in contrast to an observational study, the randomization 
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probabilities, P {A ~ l\X) ~ E {A\X) are known by design. We plan to report 

confidence intervals for E 7 (X)^ with E {A\X) unknown elsewhere. 

All of the above functionals tp (F) have a positive semiparametric information 
bound (SIB) and thus a (first order) efficient influence function with a finite vari- 
ance. In fact all the functionals tjj (F) have efficient influence function 

(1.1) IF {b (F) ,p{F),ij (F)) = if (O, b {X, F) ,p{X,F),^ (F)) , 

where b{x,F) ,p{x,F) arc functions of certain conditional expectations, and, for 
any b* (x) ,p* (x), 

Ef [IF {b\p\^ (F))] = Ef [h, (O) {b* {X) - b {X; F)} {p* {X) - p (X; F)}] 

where hi (O) is a known function. We refer to functionals in our class as doubly- 
robust to indicate that IF {b (F) ,p{F) , -0 (F)) continues to have mean zero when 
either (but not both) p{F) is misspecified as p* or b{F) is misspccified as b* . 
The functions b {x, F) ,p{x, F) , IF {O, b {X, F) ,p (X, F) , V (F)), and hi (O) differ 
depending on the functional ip (F) of interest. 

As the functionals tp {F) are all closely related, we shall use E {cov {Y, A\X)} 
as a prototype in this introduction. For ip (F) = E {cov {Y, A\X)} , b{X;F) = 
Ef{Y\X),p{X;F) = Ef{A\X), 

IF ib (F) ,p{F), 4' {F)) = {Y-b {X- F)} {A-p {X; F)} - ^ {F) , 

and hi (O) = 1. 

Whenever a functional ip {F) has a non-zero SIB, given sufficiently stringent 
bounds on Lp norms and on smoothness, it is possible to use the estimated first order 
influence function to construct regular estimators and honest asymptotic confldcnce 
intervals whose width shrinks at the usual parametric rate of We recall that, 

by definition, regular estimators are n^/^-consistent. When X is high dimensional, 
the a priori smoothness restrictions on p{X\F) and b{X]F) necessary for point 
or interval estimators of E {cov {Y, A\X)} to achieve the parametric rate of n~^/^ 
are so severe as to be substantively implausible. As a consequence, we replace the 
usual approach based on first order infiucncc functions by one based on higher order 
infiuence functions. 

To provide quantitative results, we require a measure of the maximal possible 
complexity (e.g. smoothness) of p (•; F) and b{-;F) believed substantively plausible. 
We use Holder balls for concreteness, although our methods extend to other mea- 
sures of complexity. A function h{-) lies in the Holder ball H{(3,C), with Holder 
exponent /3 > and radius C > 0, if and only if /i (■) is bounded in suprcmum norm 
by C and all partial derivatives of h{x) up to order [/?J exist, and all partial deriva- 
tives of order [/3J are Lipschitz with exponent {(3 — [/3J ) and constant C. We make 
the assumption that b (•, F) , p (•, F) lie in given Holder balls H{f3b, Cb), II{i3p, Cp). 
Furthermore, it turns out we must also make assumptions about the complexity 
of the function g {X; F) = Ep [hi (O) [X] fx {X), which we take to lie in a given 
Hif3g, Cg). For V (F) = F {cov (F, A[X)} , g {X- F) = fx {X). 

Using higher order influence functions, we construct regular estimators and hon- 
est (i.e uniform over our model) asymptotic confidence intervals for functionals 
ipi {F) in our class whose width shrinks at the usual parametric rate of n~^/^ when- 
ever l3/d= ^^^^/d > 1/4 and Pg > 0. This result cannot be improved on, since 
even when g (x) is known a priori, P/d > 1/4 is necessary for a regular estimator 
to exist. 
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When /3/d < 1/4 and g (x) is known a priori, we have shown using arguments 
similar to those of Birge and Massart ['i] that the minimax rate of convergence for an 

estimator and minimax rate of shrinkage of a confidence interval is n W*+t > n 2 . 
When g (x) is unknown, we construct point and interval estimators with this same 



rate of n -ic/d+i whenever 



(1.2) 



Pg/d > 13 /d 



2(A + l)(l-4/?/d) 



where A 



iz. - 1 



(A + 2) (1 + 4/3/d) - 4 {p/d) (1 - 4/3/rf) ( A + 1) ' 

For example, if A = 0, /3/d = 1/8, we require Pg/d exceed 

1/22 to achieve the rate n Wd+r. When the previous inequality does not hold and 
A = 0, we have constructed, in a yet unpublished paper, estimators that converge 
at rate 



(1.3) 




(1 + 2/3/d) 1. 



We conjecture that this rate is minimax, up to log factors. In this paper, however, 
the estimators we construct are inefficient when the previous inequality fails to 
hold, converging at rates less than the conjectured minimax rate of Equation (1.3). 

Let us return to the case where Y = A w.p.l. Then (F) = E {var (FjX)} and 
p{-) = b (•) so A = 0. Now, for fixed /?, Equation (1.3) converges to log (n) n^^'^/'* as 
/3g 0, which agrees (up to a log factor) with the minimax rate of n"^''/'^ given by 
Wang et al. [27] and Cai et al. [()] under the semiparametric homoscedastic model 
var (y|X) = (T^ with equal-spaced non-random X. This result might suggest that 
X being random rather than equal-spaced can result in faster rates of convergence 
only when the density of X has some smoothness, as quantified here by /3g > 0. 
But this suggestion is not correct. Recall that we obtained the rate log (n) n^'^i^/'^ 
for ip (F) = E {va.r {Y\X)} as /3g ^ under a non-parametric model. In Section 
4, we construct a simple estimator of cr^ under the homoscedastic model with X 
random with unknown density that, for P/d < 1/4, /3 < 1, and without smoothness 

restrictions on fx [x) , converges at the rate n v/d+i ^ which is faster than the 
equal-spaced non-random minimax rate of n"^'^/''. 

The paper is organized as follows. In Section 2, we define the higher order (es- 
timation) influence functions of a functional tp (F) for F contained in a model M 
and prove two fundamental theorems - the extended information equality theorem 
and the efficient estimation influence function theorem. Further, in the context of a 
parametric model whose dimension increases with sample size, we outline why es- 
timators based on higher order influence can outperform those based on first order 
influence functions in high-dimensional models. In Section 3, we introduce the class 
of functionals we study in the remainder of the paper and describe their impor- 
tance in biostatistics. The theory of Section 2, however, is not directly applicable 
to these functionals because they have first order but not higher order influence 
functions. We show that higher order influence functions fail to exist precisely be- 
cause the Dirac delta function is not an element of the Hilbert space L2 of square 
integrable functions. We describe two approaches to overcoming this difficulty. The 
first approach is based on approximating the Dirac delta function by a projection 
operator onto a subspace of L2 of dimension k (n), where k (n) can be as large as 
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the square of the sample size n. The second approach is based on approximating the 
functional ijj (F) by a truncated functional V'fe(n) (P)- The truncated functional has 
influence functions of all orders, is equal to ip (F) if either a k (n) dimensional work- 
ing parametric model (with fc (n) < n^) for the function 6(-) or the function p{-) 
in Equation (f .1) is correct, and remains close to ip (F) even if both working mod- 
els arc misspecified. Wc then use higher order influence function based estimators 
of tpk{n) (P) estimators of ip{F). These estimators ipm.kin) ^re asymptotically 
normal with variance and bias for ip (F) depending both on the choice of the di- 
mension k (n) of the working models and on the order m of the influence function of 
V'fc(n) i^)- We show that these same estimators il>m,k(n) can also be obtained under 
the approximate Dirac delta function approach. We derive the optimal estimator 
i^mopt-kapt(n) {Pb,f3p,(3g) in tlic class as a function of the Holder balls in which the 
functions 6, p, and g are assumed to lie. Finally we conclude Section 3 by show- 
ing that the estimators ipmMn) have a multi-robustness property that extends the 
double-robustness property of the first order influence function estimator ipi. 

In Section 4, we consider whether the estimators tpniapt-koptin) iPb: 0pi (3g) are 
rate-minimax. We show that whenever f3/d= ^^^^^ /d > 1/4 and Pg > 0, 

'4'mapt.koptin) if^b, (3p, f3g) is not Only rate minimax but is semiparametric efficient. 
Further, by letting the order m = m (n) of the U-statistic depend on sample size, 
we construct a single estimator ipm{n),k{n) that is semiparametric efficient for all 
(3/d > 1/4 even when g {■) cannot be estimated at an algebraic rate. We show, 
however, that when f3/d < 1/4, ipmapt,kaptXn) if^b, I3p, Pg) does not in general con- 
verge at the minimax rate. In Section 4.1, however, we construct a new estimator 

^ f f 4)3/d 

'^Kj if^gy(^b, Pp) that converges at the minimax rate of n WTd+r whenever Eq. (1.2) 
holds. In Section 5, we use the results obtained in earlier sections to construct adap- 
tive confidence intervals for a regression function [y |X = x] when the marginal of 
X is unknown and for the treatment effect function and optimal treatment regime 
in a randomized clinical trial. In Section 6.1, we discuss how to obtain higher or- 
der U-statistic point estimators and confidence intervals for functionals r (F) that 
are implicitly defined as the solution to an equation ^ (r, F) = 0. In Section 6.2, 
we define higher order testing infiuence functions and efficient scores and describe 
their relationship to the higher order estimation infiuence functions and efficient 
influence functions of Section 2. Finally, in Section 6.3, we discuss the relation- 
ship between the higher order U-statistic point estimators of an implicitly defined 
functional t (F) and higher order testing influence functions. 

Before proceeding, several additional comments are in order. In this paper, we 
investigate the asymptotic properties of our higher order U-statistic point and in- 
terval estimators. The reader is referred to Li et al. [!)] for an investigation of the 
finite sample properties of our procedures through simulation. Furthermore due to 
space limitations we only provide proofs for selected theorems. Proofs of the re- 
maining theorems can be found in an accompanying technical report. In addition, 
precise regularity conditions are sometimes omitted from both the statements and 
the proofs of various theorems. This reflects the fact that the goal of this paper is 
to provide a broad overview of our theory as it currently stands. 

Different subject matter experts will clearly disagree as to the maximum possible 
complexity of p {x; F), b (x; F) and g {x; F). Thus it is important to have methods 
that adapt to the actual smoothness of these functions. Elsewhere, we plan to 
provide point estimators that optimally adapt to unknown smoothness. In contrast 
to point estimators, however, for honest confidence intervals, the degree of possible 
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adaption to unknown smoothness is small. Therefore we propose that an analyst 
should report a mapping from a priori smoothness assumptions encoded in the 
exponents and radii of Holder balls (or in other measures of complexity) to the 
associated optimal 1 — a honest confidence intervals proposed in this paper. Such 
a mapping is finally only useful if substantive experts can approximately quantify 
their informal opinions concerning the shape and wiggliness of p, b, and g using the 
measure of complexity on offer by the analyst. It is an open question which, if any, 
complexity measure is suitable for this purpose. 

Finally, most of our mathematical results concern rates of convergence. We offer 
only a few results on the constants in front of those rates. This is not because the 
constant is less important than the rate in predicting how a proposed procedure will 
perform in the moderate sized samples occurring in practice. Rather, at present, we 
do not possess the mathematical tools necessary to obtain useful results concerning 
constants. A more extended discussion of the issue is foimd in Section 3 of Li et al. 
[9]- 

In the following, we use X„ x Yn to mean X„ ~ Op (F„) and F„ = Op (X„); 
Xn ~ Yn to mean — > 1; and X„ 3> Y^ {Xn <C Yn) to respectively mean — > 
f ^ ^ ) as n ^ oo. 



2. Theory of higher order influence functions 

Given n i.i.d observations O = 0„= {Oj, i = 1, . . . ,n} from a model 

M{e) = {F{--9),9ee}, 

we consider inference on a nonlinear functional ip {6) . In general, (0) can be infinite 
dimensional but for now we only consider the one dimensional case. In the following 
all quantities can depend on the sample size n, including the support of O, the 
parameter space 0, and the functional ip {6). We generally suppress the dependence 
on n in the notation. We will be particularly interested in models in which the 
parameter is infinite dimensional and 0, 0, and -0 (•) do not depend on n. Wc also 
briefly discuss models in which subvectors of 6 are finite-dimensional parameters 
whose dimension k (n) = ri^^P increases as power 1 + p (often p > 0) of n and thus 
9n, 0n, and ipn {■) depend on n. 

Our first task is to define higher order infiuence functions. Before proceeding 
we recall some facts about [/—statistics. Consider a function bm (oi, 02, . . . , Om) = 
b (oi, 02, . . . , o„i) where we often suppress &'s subscript m. For integers ii, 12, ■■■ ,im 
lying in { 1 , . . . , n} , we define 

^m,ii,....ijTi —bni (^?i ; : • ■ • : ) ~ ^ i^il 7 7 ■ • ■ 1 ) 

and 

(n — m)! -i ^ 

In an abuse of notation, we will consider the following expressions to be equiva- 
lent 

V„ [Bn, ] =V„ [Bm,^,,,_,J =V„ [bn, ] . 

Thus V„ [bm ] is an m*'' order U-statistic with kernel 6^ (oi, 02, . . . , o™). We do not 
assume that bm (oi, 02, . . . , Om) is symmetric. We will write V„ [Bm] as Mn,m- So, 
suppressing the dependence on n, ]B„,=V [Bm\- 
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Any Mm has a unique (up to permutation) decomposition Mm = Xi^li (^) 
under any F(.; 9) as a sum of degenerate U-statistics ©i^' (9), where degeneracy of 
[9) means that d'^'^ {9) = SJ'^ (O,, , O,, , . . . , O^^ ; 6*) satisfies 



Eg 



(o,i, • . . ,Oi,_,,Oi,,Oi,^i ...,0i/,9) 



= 0,1 = 1, 



where upper and lower case letters, respectively, denote random variables and their 
possible realizations. 

Let Um {S) be the Hilbcrt space of all [/—statistics of order m with mean zero 
and finite variance with inner product defined by covariances with respect to the 
n-fold product measure F" {■;9). Note that any [/-statistic Mg of order s, s < m, is 
also an m**^ order U— statistic with d[^^ (9) identically zero for m > I > s . 

Since any two degenerate [/— statistics of different orders are uncorrclatcd, 
the Um (6')-Hilbert space projection of Mm on 

Thus a [/-statistic Mm is degenerate <^ »,„ = B^^^ (6*) <^ He [Mm\Umt-i (9)] = 
<^ Mm & Um-1 {9)^"^-" , where lie W] =He,m ['M is the projection operator of the 
Hilbert space Um {9) (with the dependence on m suppressed when no ambiguity 
can arise) and, for any linear subspace TZ oiUm {9), TZ'^'"-" is its orthocomplement 
in the Hilbert space Um {9). Given any Mm = V [Bm], Bm' (9) is explicitly given by 
V [dm,e {Bm}] where dm,e maps Bm = b (O, ^ , O,,, , . . . , 0,;,„ ) to 

(2.1) dm^e {Bm} = b (O,, , O,, , . . . , ) 

ni— 1 

+ E Ee{b{0,„0,,,...,0,J\0,^^,0,^^,...,0,J. 

Given a hmction g (C), (= {Ci, . . . , Cr}^, define for jti = 0, 1, 2, . . . , 

_ _ d"'g (C) 
5\I,„ (0 =5\/i,...,i™ ('^) = a^:; dQ~ 

with Zs e {1, . . . ,r}, where the \ symbol denotes differentiation by the variables 
occurring to its right and the overbar Im denotes the vector (/i, . . . , Im)- Given a 
sufficiently smooth r-dimensional parametric submodel 9 (^) mapping G injec- 
tively into 6, define for 9 in the range of 9 (•), V'yj (9) = (^ip o 9^ ^ (C) iQ^g-i^gy 

and /^^^ (O„;0)=(/o^)^^^ ^ iOl^^-Her ^^^^^ f {On;0) = UJ{O^;0) is 
the density of 0„ with respect to a dominating measure. That is ijj^j (9) and 
/^j {On,9) are higher order derivatives of ip {■) and /(0„;-) under a parametric 

submodel 6'(C), where the model 9 has been suppressed in the notation. An s*"^ 
order score associated with the submodel 9 (C) is defined to be 

S^j^i9)^f^j^ (O„;0)//(O„;0), 

where j (9) is a U-statistic of order s. To understand why j (9) is a [/—statistic 

we provide formulae for an arbitrary score j (9) in terms of the subject specific 
scores 

Si,...uA(^)^f/h...i^A0j;d)/f,{0,;9) 
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j = 1, . . . , 71 for s = 1, 2, 3. Suppressing the 6'-dependence, results in Waterman and 
Lindsay [8] imply 

^iji = y^'^'i.j' 

Note these equations express each y as a sum of degenerate U-statistics. 
We now define a rn}^ order estimation influence function IF„ .^(.•j (6) =Wm,^ {0) = 
IF,„ (6) for ?/) (0) where we suppress the dependence on tp when no ambiguity will 
arise. 

Definition 2.1. A U-statistic IF,„ (6) of order m and finite variance is said to be 
an m"' order estimation influence function for ip (9) if (i) Eg [IF^, (0)] = 0, 9 G Q 
and (ii) for s = 1,2, . . . ,m and every suitably smooth and regular (see Appendix) 
r dimensional parametric submodel (C) , t* = 1,2, ... ,m, 

^^j^i9)=Eg[Wm.{9)S^j^ (9) 

Estimation influence functions need not always exist, but when they do they are 
useful for deriving point estimators of ip with small bias and for deriving confidence 
interval estimators centered on an estimate of ip. We will generally refer to esti- 
mation influence functions simply as influence functions. We remark that IF„i (9) 
is an influence function under the above definition if and only if it is one under 
the modified version in which the dimension of the parametric submodel 9 (C) is 
unrestricted. A key result is the following theorem which is related to results of 
Small and McLeish [20]. 

Theorem 2.2 (Extended information equality theorem). Given a m*^ order influ- 
ence function IF„i (9), for any smooth, regular submodel 9 (C) and s < m, 



(e (c))] /ao, • • • ao^ \^js-^{g} = (^) 



Thus, if the functionals Eg [IF^ (9*)] and — [ijj (9*) — (9)] have bounded Frechet 
derivatives with respect to 9* to order m + f for a norm \ \-\\, 

Ee [IF,„ {9 + 69)] ^ - [i^ {9 + 69 ) - i; (9)] + O [\\69 11"+^ 

since the functions Eg [IF,„ (9*)] and — [ip (9*) — ip (9)] of 9* have the same Taylor 
expansion around 9 up to order m. 

The proof is in the Appendix. Define the m*'' order tangent space Tm {9) at 9 
for the model M {&) to be the subspace of Um {&) formed by taking the closed 
linear span of all scores of order m or less as we vary over all regular parametric 
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submodels 9 (<;) (whose range includes 6) of our model A4 (O). We say a model is 
(locally) nonparametric for m^^ order inference if r„i {9) = Um {9). 

Given any to*^ order estimation influence function IF„i [9), define the m*'^ order 
efficient estimation influence function to be 

Wllf (9) = He [IF„ {6) |r„ (9)] , 

where [-j-] =ne_m [-I-] is the Z^/™ (^^) —projection operator. In the appendix, we 
prove the following: 

Theorem 2.3 (Efficient estimation influence function theorem). 

1. IIFf;['^ (9) is unique in the sense that for any two to"' order influence functions 



wl^^ [9) |r„ {B)] = Ug [iF(^) {9) |r„ [9) 



2. W^yn^ [9) is a TO*'' order estimation influence function and has variance less than 
or equal to any other to"' order estimation influence function. 

3. IF,„ {9) is a m*^ order estimation influence function if and only if 

IF,„ {9) e {iF^^'^ {9) + U,„ {9) ■ {9) e r^'"'' (0)} 

where T^'" (9) is the ortho- complement ofTm{9) in him {9)- 

4. IfWm{9) exists then ^ {9) exists for s < m and He [IF,„ (6*) |r^ (6*)] = 
Wl" (9). 

5. If the model M{Q) is (locally) nonparametric, then 

(a) there is at most one m*^ order estimation influence function IF^ {9) for 

(b) 

Wm{9)^Wm-l{9)^Wmrn{9) 

where 

Wra-l (9) = n,„,e [IF„, (9) \Ura-l {9)] 

and IF,„„i (9) is a degenerate to"' order U-statistic and thus 

Ee[Wm-l{9)Wrnmm=0. 

(c) (i) Suppose, for a given m > 2, IF„i_i (9) exists and a kernel 

m — l.m— 1 

{9) has a first order influence func- 
tion with kernel if -^^ ^j: ^ ^ . -J (Oi^\9) for all 

Oi^, . . . ,0i,„_;^ in a set Om-i which has probability 1 under f ('i^)- Then 
IF,„ (9) exists and 

(2.2) toIF,„,,„ {9) = V (d,n,6 [*/i,,;/,„_i,„_i(o,,,...,o,„_,,) (0.„;^)]) 

where the operator dm,e is given in Equation (2.1). 

(ii) Conversely, ifWm exists then the symmetric kernel if^-i m-i ("n j • ■ • i 
of IIF'm-i.m-i (^) has a first order influence function for all 
in a set Om-i which has probability 1 under F^"'"^-' {-,9). Fur- 



Jin ■ ■ ■ ^ "lm-\ 



ther 



m ^dm,e 



(O.,„;0) =z.C™(0.,,...,0, 
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Remark 2.4. Pfanzagl [11] previously proved part 5.c(i) for m = 2. Our theorem 
offers a generalization of his result. Note, in part (i) of 5(c), we can always take the 
kernel to be the symmetric kernel. 

Remark 2.5. Provided one knows how to calculate first order influence functions, 
one can obtain IF2 (6*) , . . . , IF,„ (6*) recursively using part (5.c). An example of such 
a calculation is given in Section 3.2.2 below. Thus part (5.c) has the interesting 
implication that even though higher order influence functions are defined in terms 
of their inner products with higher order scores S^^^ j , nevertheless, in (locally) 
nonparametric models, one can derive all the higher order influence functions of a 
functional ip (6) without even knowing how to compute the scores S^^ j for any 

TO > 1. In fact, one need not even be aware of the structure of the scores § 7 in 
terms of the subject-specific higher order scores Si-^,,,i^j (0). In contrast, in para- 
metric or semiparametric models whose tangent space Tm (8) does not equal the 
set Um (d) of all TO*'' order U — statistics, one can often (but not always) still obtain 
an inefficient influence IF„i (6) by applying part (5.c) of the Theorem. However, 
calculation of the efficient influence function IF^-'^ (6*) = He [IF^ {0) (0)] by pro- 
jection generally requires explicit knowledge of the scores j to derive r,„ (9). 
For this reason, it can be considerably more difficult to analyze certain parametric 
models (with dimension increasing with sample size) than to analyze (locally) non- 
parametric models. We will consider derivation of and projections onto (0) in a 
forthcoming paper. In the current paper, however, we do calculate IF2^^ (0) in one 
model that is not (locally) nonparametric so as to provide some sense of the issues 
that arise. Specifically in Section 4, we calculate IF!^^^ (0) for a truncated version 
of the functional E \{E [Y\X]y 
of X is known. 



in a model that assumes the marginal distribution 



Remark 2.6 (Implications of Theorem 2.3 for the variance of unbiased estimators). 
Suppose we have n iid draws O = (Oi, . . . , 0„) from F(o; 6),0 G Q, and a U-statistic 



1pm of order m < n with varg 



< 00 ior G Q satisfying Eg 



for all 6 G Q. That is, ipm is unbiased for ip{0). We will use Theorem (2.3) to 
generalize a number of well-known results on minimum variance unbiased estimation 
to arbitrary models. 



ByEe 
statistic, Ip: 



= ip{0), we immediately conclude that, viewing i]j k^'^ order U- 

tp (0) is a fc*'' order estimation influence function for -0 {0) for n > fc > 



TO. By Theorem 2.3, varg 



■ipn 



> varg 



. We refer to var* 



w:i^ {0) 



the TO*'' order Bhattacharyya variance bound at for the parameter ip (0) in model 
M (0), as this definition, in a precise analogy to Bickel et al. [3]'s generalization of 
the Cramer-Rao variance bound, generalizes Bhattacharyya's [2] variance bound to 
arbitrary semi- and non- parametric models. Indeed our first order Bhattacharyya 
bound is precisely Bickel et al.'s [3] generalization of the Cramer- Rao variance 
bound. 



We shall refer to an to order U-statistic estimator tpm as to order "unbiased 
locally efficient" at 0* for ip {0) in model M. (0) if it is unbiased for ip [0] under the 
model with variance at 0* equal to the to*'' order Bhattacharyya bound at 0* . If 
iprri is "unbiased locally efficient" at 0* for all 6** G 9, we say it is 'unbiased globally 
efficient'. By Theorem 2.3, vare [iF^-'^-'^ (e*)] > varg hw^JJ {0)] for n>k> m. As 
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a consequence if an m**^ order 'unbiased locally efficient' estimator il!m,eff exists 
at 9* then, for n > fc > m, IFf ^ (9*) = IF^^ (6*) so the m"^ and fc*^ order 
Bhattacharyya bounds are equal at 9* and ipm,eff is also fc"^ order 'unbiased locally 
efficient' at 9*. 

From the fact that for an unbiased estimator ipm, ipm — "0 (^) is an m^^ or- 
der influence function, wc conclude that the variance of "0™ attains the bound 
vare. kw^^f {6*)] at 9* if and only if ?A„ - V' (9*) = IF^/^ (9*). It follows that V-m 

is 'unbiased globally efficient' if and only if i/'m - V' (^) = I^^'m^ (^) foi' ^-ll ^ G ©• 
Wc thus have proved the following theorem in the =^ direction. The <S= direction is 
immediate. 

Theorem 2.7. In a model A4 (Q), there exists an m}^ order unbiased globally 
efficient U-statistic estimator ofip (9), if and only if, for all 9 ^ Q, IF^-^ (9) + (9) 
is a function '4'm,eff of the data O, not depending on 9. In that case, iprn,eff is the 
unique unbiased globally efficient estimator. 

In a locally nonparamctric model all unbiased m"^ order estimators are unbiased 
globally efficient, as there is a unique m*'^ order influence function. For example, the 

usual unbiased estimator = SiLi ^ Sj=i / ^ 1) of the variance 

of a random variable X is a second order U-statistic and thus is a fc'^ order unbiased 
globally efficient U-statistic for fc > 2 in the locally nonparamctric model consisting 
of all distributions under which has a finite variance. 

In Section 4 we use the results from this remark to compare the relative efficien- 
cies of competing rate-optimal unbiased estimators in a model which is not locally 
nonparamctric. 

We now describe the main heuristic idea behind using higher order influence 
functions. Technical details are suppressed. Consider the estimator 

(2.3) = V (e) + w^J^f (d) 

based on a sample size n, where 9 is an initial rate optimal estimator of 9 from a 
separate independent training sample. That is we assume that our actual sample 
size is TV and we randomly split the N observations into two samples: an analysis 
sample of size n and a training sample of size N — n where {N — n) /N = c*, 
1 > c* > 0. We obtain our initial estimate 9 from the training sample data. 
Sample splitting has no effect on optimal rates of convergence, although in the 
form described here docs affect 'constants'. Throughout the paper, we derive the 
properties of our estimators conditional on the data in the training sample. In a 
later section, we describe how one can sometimes obtain an optimal constant by 
choosing (A^ — n)/N~ N^'^,e > rather than c* . 

Remark 2.8. Note that sample splitting is avoided in most statistical applications 
by using modern "empirical process theory" to prove that 'plug-in' estimators such 
as ipm = 1^ (9) + IF^-^ ^that estimate 9 from the same sample used to cal- 

culate IF^^ (•) have nice statistical properties. However empirical process theory 
is not applicable in our setting because we are interested in function classes whose 
size (entropy) is so large that they fail to be Donsker. For this reason we initially 
believed that explicit sample splitting would be difficult to avoid in our method- 
ology. However, in Robins et al. [18], we describe a new method that effectively 
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allows one to use all the data for estimator construction. 



Expanding and evaluating conditionally on the training sample (or equivalently 
on 6), we find by Theorem 2.2 that the conditional bias 



Ee 



^p,n-ijio)\e =4'[e] -i'{e) + Eg 



IF, 



is Op (^\\6 — 6*11™+^^ which decreases with m provided \ \6 — 
In Theorem 3.22 below, we show that if 



< 1. 



SUPoeO 



f o: 



f{o;0) 







as 1 1 6* — 6*11 ^0 , where / (o; 9) is the density of O under 6 and O has probability 
one under all 6* G O, then 



vare 



■0" 



vara 



var^ 



if: 



e)] (^i + Op{\\e-e\\)) 



Now, by Theorem 2.3, var-- 



var-: 



IF' 



if: 



eff 



increases with m. Further, 

'^ff I o \ tl^g sample average of 



< l/n, since, conditional on 9, W-^^ 
iid random variables. 

To proceed further we shall need to be more explicit about the model Ai (9). 
For now, we consider finite-dimensional parametric models whose dimension k (n) 
increases with sample size. That is 9 = 9n depends on n and the dimension of 
Q = On is k{n). Suppose k{n) x n^,"/ > 0. Let 9n be the maximum likelihood 
estimator of 9. If k{n) increases slower than the sample size (i.e., 7 < 1), then, 

a) under regularity conditions, \\9n — 9n\\ ~ Op (j^k{n) /n}^^'^^ = Op |^n~^(^ 



with II • II the usual Euclidean norm in ; and b) var' 



if: 



although 



increasing with to, remains order l/n; as a consequence, if to is chosen greater 
than the solution to* to n~ 



(1-7) 



n the bias of ipm will be Op in ^/^), 
the rate of convergence will be the usual parametric rate of n~^/^, and thus, for 
n sufficiently large, the squared bias of ipm will be less than the variance. As a 
consequence, as discussed in Section 3.2.5, we can construct honest (i.e uniform 
over 9n S 6„) asymptotic confidence intervals centered at ^rn* with width of order 
n~^l'^ . Here is a concrete example. 

Example. Suppose O = (^j -'^) with Y Bernoulli and with X having a density 
with respect to the uniform measure /x(-) on the unit cube [0, 1] in B!^ . Suppose 



(E \y\X\) . Let {z; (•)} = {z; (x) ; 1, 2, . . .} be a countable, linearly inde- 



pendent, sequence of either spline, polynomial, or compact wavelet basis functions 
dense in L2 (/i.). Set Zfe (x) = {z\ [x) , . . . , Zk (x))'^ . We assume 



E{Y\X = x) e 



l + exp(-77r.(„)2fc*(n) {x) 

Vk'in) £ ^k*(n) 



^ f / (x; Wfe..(„)) = c {ujk"(n)) exp wJ..(„)Zfc"(„) (x) 
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where c(u;fc.*(„)) is a normaHzing constant and Mk*{n) ^md Wfc**(n) are open 
bounded subsets of i?'^ and i?*^ Hence, 9„ = Afk{n) x yVfe(„) has dimension 
k {n) = k* (n) + k** (n) and -tp (0) = ipn ^ J b'^ / {x;ujk*-{n)) d^i {x). 



He [7] and Portnoy [12] prove that, under regularity conditions, ||6'„ — 9„ 
Op {^{k (n) /n}^^^^ when k (n) = rf 't^ n. Below we shall see that 



IF: 



(o) I? 



1/n for n'^ <JC n. 

Consider next models whose dimension k{n) x ri^ increases faster than n (i.e., 
7 > 1). In such models, the MLE 0„ is generally inconsistent and indeed there may 
exist no consistent estimator of 6^- In that case, ||6'„ — 0„|| fails to be Op (1) and the 
conditional bias Eg ipm — ip {&) \0 may not decrease with m. In order to guarantee 
consistent estimators of 0n exist, it is necessary to place further a priori restrictions 
on the complexity of 0„. Typical examples of complexity-reducing assumptions 
would be an e— sparseness assumption that only k{nY ,0 < e < 1, of the k{n) 
parameters are non-zero or a smoothness assumption that specifies that the rate 
of decrease of the j*'*component of 0„ is equal to 1/j raised to a given (positive) 
power. Even after imposing such complexity-reducing assumptions, tp (6) = (0„) 
may not be estimable at rate n~^/^. 

For instance consider the previous example but now with 7* and 7** exceeding 



1, so k** (n) ~ rC 3> n, k* (n) ^ rC ^ n and k (n) 
n with 7 ~ max (7**, 7**). Consider the norms 



f {x;uJk"(n)Y (x) 



k** {n) + k* (n) x n'^ > 



1/2 



i/p 



and 



rfc"(»)llp- 



Suppose, under a particular smoothness assumption, optimal rate estimators 77^,, 



(") 



and ujk"{n) of 77a..(„) and Wfc"(„) satisfy 



Op (n-T") and 



e^e\\p^Op{ 

expect that var- 



for some 7,, > 0,7^^ > and all p > 2. Hence, 
116* — 6'llp ~ Op (max{n^''''', n"'''^}). For 7 > 1, based on arguments given later, we 



Ea 



i,{e)\ 



Op {n-^- 



and 



Vk'[n) ^ Vk*{n) 



^k"{n) — '^k-"{n) 



Op (^n-27.-(™-i)7.j 



= Op[\\e-e\\zt\ 



To find the estimator ipm^ ^ in the class ■(/;,„ with optimal rate of convergence, 
let m* = 1 + i^^^Si^l'i.i ^® value of m that equates the order 7i-47„-2(m-i)7^ 

of the squared bias and the order — — ^ of the variance. Then m^^^^ = [m*J if 

the order n"'*'''''^^^™^^^'''"' -|- ^('''^-'^^("'^-'^^^-'^of the mean squared error at [m*\ is less 
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than or equal to that at \m*~\ . Otherwise, m^^^^ = \m*~\ . The rate of convergence of 
Vj,„^ ^ will often be slower than n~^/^. Note m^^^^j = 1 whenever 7 > 2, regardless 
of 7,, and 7i^. 

By using the estimator ip^rn*] rather than ipm^ ^, we can guarantee that the 
variance asymptotically dominates bias and construct honest (i.e uniform over 0„ € 
Qn) asymptotic confidence intervals centered at ip\m*] - Of course, the sample size 
n at which, for all 6„ G 0,i, the finite sample coverage of the intervals discussed 
above is close to the asymptotic (i.e. nominal) coverage is generally unknown and 
could be very large. For this reason, a better, but unfortunately as yet technically 
out of reach, approach to confidence interval construction is discussed in Section 
3.2.5. 

In contrast to the case of parametric models of increasing dimension, in the infi- 
nite dimensional models which we consider in the following section, the functionals 
tp {9) of interest have first order influence functions IFi (9) but do not have higher 
order influence functions. As a consequence, an initial 'truncation' step is needed 
before we can apply the approach outlined in the preceding paragraph. 

Finally, even in the case of parametric models with k (n) ^ n and complexity 
reducing assumptions imposed, , when the minimax rate for estimation of ip (9) is 
slower than n~^/^, the optimal estimator ^ in the class ipm will generally not 
be rate minimax. Sec Section 3.2.6 and Sections 4.1.1 for additional discussion. 



3. Inference for a class of doubly robust functionals 
3. 1 . The class of functionals 

In this Section we consider models in which the parameter 9 is infinite dimensional 
and 9, 0, and ip (•) do not depend on n. We make the following three assumptions 
(Ai)-(Aiii): 

(Ai) The data O includes a vector X, where, for all 9 E Q, the distribution of 
X is supported on the unit cube [0, 1] ( or more generally a compact set) in R'^ and 
has a density / (x) with respect to the Lebesgue measure. Further = 0i x 62 
where 9i £ &i governs the marginal law of X and ^2 G 62 governs the conditional 
distribution of 0\X . 

(Aii) The parameter 92 contains components b = b{-) and p = p (•), b : [0, 1]'' 
TZ and p : [0,1]'' TZ, such that the functional ip (9) of interest has a first order 
influence function IFi,,^ (9) = V [IF^ ^ (9)] , where 

(3.1) /Fi,^ i9) = H{b,p)~^j{9), 
with H{b,p) = h{0,b{X) ,p{X)) 

(3.2) = b iX)p{X) hi (O) + b {X) h2 (O) + p {X) hs (O) + h4 (O) 

= BP Hi + BH2 + + i?4, 

and the known functions /ii (•) , /i2 (•) j ^3 (•) i ^4 (') do not depend on 9. 

(Aiii) (a) 626 x 92p C 62 where 026 and 62^ are the parameter spaces for 
the functions b and p. Furthermore the sets @2b and 62p are dense in L2 {Fx {x)) 
at each 9^ e Qi. 

or 

(b) b* (•) = p* (•), hs (O) = h2 (O) w.p.l, and 626 C 62 is dense in L2 (Fx {x)) 
at each 91 e Qi. 
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Remark 3.1. (Aiiib) can be viewed as a special case of (Aiiia) as discussed in 
Example la below, so we need only prove results under assumption (Aiiia). 

Assumptions (Ai)-(Aiii) have a number of important implications that we sum- 
marize in a Theorem and two Lemmas. 

Theorem 3.2 (Double- robustness). Assume (Ai)-(Aiii) hold, and recall p and b 
are elements of 9. Then 

Ee [H {b ,p*)] = Eg [H {b\p)] - Eg [H (6,p)] = i, [0) 

for all (j)*,b*) G Gap x Gab, 9^0. 

Proof Eg [H {b*,p)] - Eg [H {b,p)] = Eg [{Hip {X) + H2} {b {X) - b* (X)}] and 
Eg [H [b ,p*)]-Eg [H {b,p)] = Eg [{Hib (X) + H3} {p (X) - p* (X)}]. The theorem 
then follows from part 1) of the following lemma. □ 

Theorem 3.2 states that H{-,-) has mean (9) under F{-;9) even when p is 
misspecified as p* or b is misspecified as b* . We refer to the functional i(j {9) as 
doubly robust because of this property. The next lemma shows that H {b* ,p*) is not 
unbiased if both b and p are simultaneously misspecified. That is, Eg [H {b*,p*)] 7^ 
ij{9). 

Lemma 3.3. Assume (Ai)-(Aiii) hold. Then 

1. Eg [{HiB + i/3} 1^] = Eg [{HiP + H2] \X] = 

2. Eg [H {b*,p*)] - Eg [H {b,p)] = Eg [{B - B*) [P - P*) Hi] 

and ^ (9) = Eg [H {b,p)] = Eg [-BP Hi + H^] 

Proof. Part (1): By assumptions (Ai) and (Aiiia) we have paths 9i {t) ,1 = 1,2, ... , 
in our model with 9i (0) = 9 and pi (t) = pi {x; t) = p (x) -I- tci (x) , bi (x; t) = 
b{x) ,Fi{x;t) = F [x) for / = 1,2,..., where the sequence q (•) is dense in 

L2 [Fx [x)]. Let Si (9) be the score for path 9i (t) at t = 0. Then by ip (di (t)) = 

[H{b,pi m 

dik {9i (^)) /dt^t^o = Eg [{HiB + H3} ci (X)] 
+ Eg[H{b,p)Si{9)]. 

By IFi,^ {9)^H{b,p)-i'{9), 

dip (9i (O) /dt|i=o = Eg [H {b,p) Si] . 
Thus E [{HiB + ffa} ci {X)] = 0. But {q (•)} is dense in L2 [F^ {X)] so 

E[HiB + H^[X] = 0. 
An analogous argument proves Eg [{HiP + H2] [X] = 0. Part (2): Eg [H {b*,p*)] - 

Eg[H{b,p)] = 

Eg [{B*P* - BP) Hi + [B* - B)H2 + (P* - P) H3] 
^ Eg [{B*P* - BP) Hi - {B* ~ B) PHi - {P* - P) BHi] 
^Eg[{B-B*){P-P*)Hi], 

where the second equality is by part 1). Choosing P* = B* = w.p.l completes 
the proof of the theorem since then Eg [H {b*,p*)] = Eg [H4]. □ 

Below we will need the following partial converse to Lemma 3.3. 
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Lemma 3.4. Let O26,02p,0i o.nd Q and H (b,p) be as defined in (Ai)-(Aiiia). 
Suppose that 

Eg [{HiB + Hs} \X] = Eg [{HiP + H2} \X] = 

and -0 (6) = Eg [H {b,p)]. Then V [H {b,p) — ip (^)] is the first order influence func- 
tion of ^ (9) . 

Proof. The influence function of the functional Eg [H {b* ,p*)] for known functions 
b*,p* is Y[H{b*,p*) - Eg [H(b*,p*)]]. Thus by the linearity of first order influ- 
ence functions, the Lemma is true if and only if for each Oq Cz &, the functional 
T(b,p) = Effg [H (b,p)] with 60 fixed has influence fimction equal to w.p.l at 
{b,p) = (60, po) C 6*0. That the influence fimction is equal to follows from the fact 
that, under the assumptions of the Lemma, for sets {q (•)} and {di (•) } dense in 
L2 [Fo{X)], 

dEe, [H (60 {X) + tci {X) ,po (X) + tdi (X))] /dt\t=o 

= Eg [{H,bo (X) + H3} di (X)] + Eg [{H,po {X) + H2] ci {X)] - 0. □ 

Results of Ritov and Bickel [14] and Robins and Ritov [15] imply it is not possible 
to construct honest asymptotic confidence intervals for -0 {0) whose width shrinks 
to as n — > 00 if b{-) and p{-) are too rough. Therefore we also place a priori 
bounds on their roughness. Our bounds will be based on the following definition. 

Definition 3.5. A function h{-) with domain [0, 1]'' is said to belong to a Holder 
ball H{P, C), with Holder exponent /3 > and radius C > 0, if and only if h (•) is 
uniformly bounded by C, all partial derivatives of h{-) up to order [/3J exist and 
are bounded, and all partial derivatives V^^^ of order [/?J satisfy 



sup 

x,x+5xe[0,lf 



vL'^J/i(x + (5x)- vL'3J/i(x) <C||fa||'' 



-L/3J 



We note that the Lp,2 < p < 00 and Loo rates of convergence for estimation of 
a marginal density or conditional expectation h (•) S H{P, C) are O (n^ ^/j+d J and 



^ ^(logrl) respectively. We refer to an estimator attaining these rates as 

rate optimal. 

We make the following fourth assumption: 

(Aiv) We assume 6 (•) 7 P (Oj g(-) he in given Holder balls L[{l3b, Cb), L[{l3p, 
Cp), HiPg,Cg) where 

(3.3) g (x) = E {Hi\X ^ x} f (x) . 

Furthermore we assume g (X) > ag > w.p.l. Finally we assume, as can always 
be arranged by a suitable choice of estimator, that the initial training sample esti- 
mators b (.) ,_p (.), and g (•) are rate optimal, have more than max{/3;,, /3g, /3p} deriva- 
tives, and have Loo norm bounded by a constant Coo- Further inf^gjg g{x) > <jg,. 
The reason for the restrictions on g{-) will become clear below. 

The restrictions (Ai)-(Aiv) are the only restrictions common to all functionals 
and models in the class. Additional model and/or functional specific restrictions 
will be given below. 

To motivate our interest in such a class of functionals and models we provide 
a number of examples. In each case, one can use Lemma 3.4 to verify that the 



Higher order influence functions 



351 



influence function of tp (6) is as given. All but Examples 3 and 4 are examples of 
(locally) nonparametric models. 

Example 1. Suppose 0~{A,Y, X) with A and Y univariate random variables. 

Example la (Expected product of conditional expectations). Let ip (9) ^ Ee[p{X )h [X )] 
where h{X)=Eg [Y\X] ,p{X)=Eg [A\X]. In this model 

/Fi,^ {e)=p{X)b{X)-^j (9) 

+ p{X){Y-b{X)} + b{X){A-p{X)} 

so Hi = -1, H2 = A, H3 ^Y,Hi = 0. 

We also consider the special case of this model where A = Y with probability 
one {w.p.l). Then, as in assumption (Aiiib), b{X ) =p{X ) w.p.l, H2 = i?3 w.pA. 
Then i; (9) = Eg [b^ {X )] . In Section 5, we show how our confidence interval for 
Eg [6^ {X )] can be used to obtain an adaptive confidence interval for the regression 
function b (■). 

Example lb (Expected conditional covariance). 

(9) = Eg [AY] ~Eg[p{X)b {X )] = Eg [covg {Y, A\X}] 

has influence function 

AY - {p{X )b{X ) +p{X ){Y -b{X)} + b{X ){A- p{X)}} ~ i;{9) , 

so Hi ^ 1,H2 = -A, H3 = -Y, Hi = AY. 

Example Ic below shows that a confidence interval and point estimators for 
Eg [covg {Y, A\X}] can be used to obtain confidence intervals and point estimator 
for the variance weighted average treatment effect in an observational study. 

Example Ic (Variance- weighted average treatment effect). Suppose, in an obser- 
vational study, O = {Y* , A, X}, A is a binary treatment taking values in {0, 1}, Y* 
is a univariate response and X is a vector of pretreatment covariates. Consider the 
parameter r (9) given by: 

4^ (n. ^ Eg[covg{Y*,A\X)] ^ Eg[covg{Y*,A\X)] 

' Eg[v&TgiA\X)] Eg[7:{X){l-TTiX)}Y 

where tt {X) = pr {A = l\X) is often referred to as the propensity score. We are 
interested in r {9 ) for several reasons. First, in the absence of confounding by 
unmeasured factors, r (9) is the variance-weighted average treatment effect since 
T (9) can be rewritten as Eg [wg{X)j {X;9) ] where wglX) = Eg[var^A]x)] ^^'^ 

7(2;; 6*) ^ Eg{Y*\A = 1,X = x) - Eg{Y*\A = 0,X ^ x) 

is the average conditional treatment effect at level x of the covariates. Second, under 
the semiparametric model 

(3.5) 7{X:9) =v{9) w.pA 

that assumes the treatment effect does not depend on X, t (6* ) = v{0). In Re- 
mark 4.2, we briefly consider inference on t {9) under model (3.5). However since 
the model (3.5) may not hold and therefore the parameter v (9) may be undefined, 
our main goal is to make inference on r (9) without imposing (3.5). 
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Now if for any r G i?, we define V' {t, 0) to be 

(r, e) = Eg [{Y* (r) - Eg {Y* (r) \X)} {A - Eg {A\X)}] , 

with Y* (r) = y* — rA, it is easy to verify that r {6) may also be characterized 
as the solution t = r (0) to the equation ip (r, 9) = 0. Thus inference on r (6* ) 
is easily obtained from inference on ip{T,6). In particular a (1 — a) confidence 
set for T (9) is the set of r such that a (1 — a) CI interval for i/j (t, 9) contains 
0. Therefore, with no loss of generality, we consider the construction of a (1 — a) CI 
for ip (r, 9) for a fixed value t ~ t, and write Y = Y* (r) and ip {9) = ip (r, 9). Thus 
■0 {9) = Eg [covg {y, and we are in the setting of Example lb. 

In Section 6, we show the rates at which the width of the confidence sets for 
ijj (r, 9) and for r (9) shrink with n arc equal. 

Example 2a (Missing at random). Suppose O = {AY, A ,X ) where Y is an 
outcome that is not always observed. A is the binary missingness indicator, X is 
a d-dimensional vector of always observed continuous covariates, and let b (X) = 
E{Y\A = l,X),7r(X) = P{A = 1\X) be the propensity score, andp(X) = 1/7t{X). 
We suppose tt{X) > a > and define 



(3.6) V {0) = Eg 



AY 



niX) 



= Ee[b(X)]. 



Interest in ip (9) lies in the fact that ?/' (^) is the marginal mean of Y under 
the missing (equivalently, coarsening) at random (MAR) assumption that P{A — 
1\X,Y) = Tr{X). In this model IFi^^, {9) = Ap{X) (Y - b{X)) + b{X) - ip (9) so 
Hi = -A, H2 = 1,H3 = AY, Hi = 0.' 

Note that if one has assumed a priori that fx (■) and p (X) lay in Holder balls 
with respective exponents Pf^ and Pp, then Pg would be min(/3/^ ,Pp), since 
g{X) = -fxiX)/piX). 

Example 2b (Missing not-at random). Consider again the setting of Example 2a 
but we no longer assume MAR. Rather we assume 

P{A^l\X,Y) = {l + cxp{- [-y (X) + aY]}}-^ 

may depend on Y, where now 7 {X) is an unknown function and a is a known 
constant (to be later varied in a sensitivity analysis). In this case the marginal 
mean of Y is given by -0 (9) — Eg {AY [1 + exp {— [7 {X) + aY]}]). Robins and 
Rotnitzky [17] proved this model places no restrictions on F (o) and derived 

/Fi,^ {9)^A{l + exp{-aY}p{X)}{Y-b{X )} + b {X) - i; {9) 

where, now, 

b{X ) = E\Ycxp{-aY} \A = 1, X] / E [cxp {-aY} \A = 1,A:] , 

and p {X) = exp {-7 {X}}. Thus 

Hi = - exp {-ay} A, H2 = {1- A},H3 = Ay exp {-ay}, 

and Hi = AY . When a = this provides an alternate parametrization of Exam- 
ple 2a. 

Example 3 (Marginal structural models and the average treatment effect). Con- 
sider the set-up of Example Ic including the non-idcntifiablc assumption of no 
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unmeasured confounders, except now A is discrete with possibly many levels and 
f {a\X) > S > w.p.l. A marginal structural model assumes Ef^ {Eg{Y*\ 
A = a,X)} = d {a,v (6)), where d{a,v) is a known function and v (9) is an un- 
known vector parameter of dimension d* . When A is dichotomous with a G {0, 1} 
and d(a,v) = vi + V2a, then V2 (9) is the average treatment effect parameter. Let 
/* (a) be any density with the same support as A and let s* (a) be a (i*-vector func- 
tion, both chosen by the analyst. Then v (9) is identified as the (assumed) unique 
value of V satisfying 



(9) = Ee 



fiA\X) 



0, 



where s (O, a, v) = {Y* — d (a, v)} s* (a). Thus a (1 — a) confidence set for v (9) is 
the set of vectors v such that a {1 — a) CI for tp^j {9) contains 0. Therefore, with 
no loss of generality, we consider the construction of a (1 — a) CI for the d— vector 
functional tp (9) = ip:^ (9) for a fixed value v and define h (O, A ) = s (O, a, v) and 
b{a,X ) = Eg [h (O, a ) \ A ~ a, X]. Then t/j- (9) has influence function 

W = T7T^^{^' (O, A)- b{A,X )}+ f b{a,X)dF* {a)-^{9). 



f{A\X) 

Next define p{a,X) = 1/ f {a\X) {9,a) = Ej^ [h{a,X )]. Then IFi{9) is the 
integral 

IFi [9) = j dF* (a) IFi (a, 9) , 

IFi (a, 9) = Hi (a) p (a, X) b{a,X ) 

+ H2 (a) b{a,X)+ H3 {a)p{a, X)~i^{9,a), 
Hi (a) = -/ (A = a) , H2 (a) = 1, H3 (a) = I {A = a) h (O, a ) . 

It follows that IFi (9) is a integral over a £ ^4 of influence functions IFi (a, 9) 
for parameters ip {9, a) in our class with H4 (a) = 0. Thus we can estimate ip [9) 
by J dF* (a) -0(0), where V' («) is an estimator of ip{9,a). If the support of A is 
of greater cardinality than d*, the model is not locally nonparametric. Different 



fjA) 



is invert ible 



choices for s* (a) and /* (a) for which {d/dv^} Eg s (O, A, v) fj^A\x) 
may result in difference influence functions. All yield the same rate of convergence, 
although the constants differ. See Remark 2.5 above. Extension of our methods to 
continuous A will be treated elsewhere. 

Example 4 (Confidence intervals for the optimal treatment strategy in a random- 
ized clinical trial). Consider a randomized clinical trial with data O = {¥, Y*, A, X}, 
j4 is a binary treatment taking values in {0, 1}, Y* and Y arc univariate responses, 
X is a vector of pretreatment covariates. In a randomized trial, the randomization 
probabilities ttq {X) = P {A = 1\X) are known by design. Let b{x) = Eg{Y*\A = 
1,X = x)~ Eg(Y*\A ^0,X = X ) diidpix) = Eg{Y \A^1,X = x)- Eg{Y \A = 
0, X = X ) be the average treatment effects at level X — x onY* and Y . We as- 
sume Y and Y* have been coded so that positive treatment effects are desirable. Let 
Ip (9) = E [b {X)p {X)]. Because the model is not locally nonparametric there exists 
more than a single first order infiuence function. Indeed, for any given function c (•), 

{9, c) ^b (X) p (X) - ^ {9) + [b {X) {Y - Ap {X)} + p {X) {Y* - Ab {X)}] 
X {A - vro (X)} a^'' (X) + c (X) {A - no (X)} , 
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with (Jq (X) = ttq {X) {1 — ttq {Xy\ is an influence function in our class [provided it 
is square integrable] with 

Hi = l-2A{A-^Q {X)} {X) , 
H2={A- 7^0 (X)} a^^ {X) Y, 
i/3 = {^-^0 {X)}a^^ {X)Y\ 
Hi = c {X) {A - TTo {X)} . 

As c(-) is varied, one obtains all first order infiuence functions. Wc do not discuss 
the efficient choice of c(-) in this paper. 

Our interest lies in the special case where Y = Y* w.p.l (so there is but one 
response of interest) and thus, as in assumption Aiiib), b ^ p, H2 = and we 
construct confidence interval for ip (9) = E [6^ (X)] . In Section 5 we describe how we 
can use a confidence interval for i/j (9) = E [6^ (X)] to obtain confidence intervals for 
the treatment effect function b (x) and, most importantly, for the optimal treatment 
strategy dopt (x) = I [b (x) > 0] under which a subject with covariate value x is 
treated if and only if the treatment effect b (x) is positive ( i.e., dopt (x) = 1). 

3.2. Higher order influence functions for our model 

3.2.1. Dirac kernels, truncation bias, and a truncated parameter 

In all of our examples the functions p(-) and b{-) are functions of conditional 
expectations given the continuous random variable X. It is well known that the 
associated point-evaluation functional p (x) and b (x) do not have first order infiu- 
ence functions. It then follows from part 5c of Theorem 2.3 and the dependence of 
1^1,1/' (^) = ^ [^fi,4' {Oil ; ^)] on b (•) and p (•) evaluated at the point X that, in none 
of our examples, does ip (9) have a second (or higher) order influence function. 

As a precise understanding of the reason for the nonexistence of higher order 
infiuence functions for ip (9) is fundamental to our approach, we now use part 5c of 
Theorem 2.3 to prove that IF2,^ {9) does not exist by showing that the functional 
(o; &) does not have a first order infiuence function V ^(o; ) (O; 9)] . Let 

Fx and fx = fx{') denote the marginal CDF and density oi X. In this proof, we 
do not assume that p{-) and 6(-) are functions of conditional expectations. Rather 
we only assume that our functional satisfies assumptions Ai)-Aiv) 

Consider paths (parametric submodels) 9i (t) such that 9i (0) = 9 satisfying 

pi(t)=pi{x,t)=p (x) + tci (x) , 
bi{t ) = bi{x,t ) = b (x) + tai (x) , 

where the sequences ci (•) and a; (•),/ = 1, 2, ... , are each dense in L2 [Fx (x)]. Let 

si {0;9) = si {0\X;9) + si {X-9), 

si {0\X;9), and s; {X;9) denote the overall, conditional, and marginal scores 

dlnf (O; 9i (0)) /dt, dlnf (o|X; 9i (0)) /dt, dlnfx [x; 9i (0)) /dt. 

By linearity, ifi^^ (o; 9) has an infiuence function only if the functionals b (x) and 
p (x) have one as well. Now by differentiating the identity 

E^^^^^ [{H,bi iX,t)+H3}\X = x]=0 
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with respect to t and evaluating at t = 0, we have 

~Eg[{{Hib{X ) + H3}}si {0\X)\X = x]^ Ee[Hi\X = x] ai (x) . 

However, by definition, b (x) has an influence function V [ifi^b{x} {O; ^)] at 6 only if 
for Z = 1, 2, . . . , both dbi (x, t ) /9i|t=o = a; {x) equals Eg [ifiXx) (O; 0) si (O; 0)] 
and i?e (O; f?)] = 0. Thus if ifi^b{x) {O', 6) exists, it must satisfy 

- Ee [{Hib{X ) + H3} si {0\X) \X = x] 

= Ee [Hi \X = x] Ee (O; 0) si (O; 9)] . 

Without loss of generality, suppose Hi > w.p.l. Now if we could find a 'kernel' 
Kfx,oo {x, X) such that 

r{x)^EfAKf..o. {x,X)r{X )] 



(3.7) 
then 



Kf^,oo {x, X*) r (x*) fx ix*)dx* for all r (•) G L2 (Fx) 



x{Ee[Hi\X]}-^/^{Hib{X 
would be an influence function since 



{Ee [Hi\X =. x]}-^/' A^^.o, ix,X ) 



Ee[Hi\X = x]Ee 



E[Hi\X = x]^/^ Ef 



-{Ee [Hi\X = x\y^'^K}^^^ {x,X) x 
{Ee [Hi \X]}-'/^ {Hib {X ) + H^} si (O; 9) 

-Kf,,^{x,X){Ee[Hi\X]]-'/^ 
X {Hib {X ) + i/3} {.s( (0|X) + .s, {X)] 



E[Hi\X ^xf^ Ef^ 



Eg 



K 



fx,^{x,X){Ee[Hi\X]] ^/'x 
i/3}si {0\X)\X 



{Hib{X 

= [(ifi6 (X) + i/3) (0|X) \X^x]. 
By an analogous argument 

x{Ae[iJi|X]} ^^^{Hip{X 



ifi.p(x) {0\9) 



{Ee[Hi\X^x\} ''^ Kf^^^{x,X) 

H2} 



would be an influence function. 

Indeed since the sequences {c; (•)} and {a; (•)} are dense the existence of such 
a kernel is also a necessary condition for ifi.b(x) (O; 9) and ifi^p(x) {O; 9) to exist 
and thus for i/i,^ (o; 6*) to exist. A kernel satisfying Equation (3.7) is referred to 
as the Dirac delta function with respect to the measure dFx (x) and would clearly 
have to satisfy 



(3.8) 



-^/x,oo {Xi^^Xi2) — if Xi2 7^ X^-^ 



were it to exist. Of course a kernel satisfying Equation (3.7) is known not to exist in 
L2 [Fx] X L2 [Fx]- We conclude that i/i,^ (o; 9) does not have an influence function 
and therefore l¥2.2,4> (9) does not exist. 



A formal approach 

To motivate how one might overcome this difficulty, we note that kernels satisfying 
Equation (3.7) exist as generalized functions or kernels (also known as Schwartz 
functions or distributions). Wc shall 'formally' derive higher order influence func- 
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tions that appear to be elements of the space of generalized functions. However, 
we use these calculations only as motivation for statistical procedures based on or- 
dinary kernels living in L2 [Fx] x L2 [Fx]- Thus it docs not matter whether these 
formal calculations could be made rigorous with appropriate redefinitions. Rather 
we can simply regard the following as results obtained by applying a "formal calcu- 
lus" to part 5c of Theorem 2.3 that adds to the usual calculus additional identities 
licensed by Equations (3.7) and (3.8). 

We will need the fact that, for any function v {x; 6), Eq. (3.8) implies that 



fx, 00 



ix,X). 



v{x-e)Kf^^oo {x,X)^v{X 
We now show that 

IF2,2.^ [9) ^W[IF2^2.,i,.n., m = He ^2 [V [*/i,,/^_^(o,^,) (O.,;0) /2] [U^''' (0) 
would formally have U-statistic kernel 



(3.9) IF2^2,iKiiA2 



[0) 



£b,ii {d) Eg [Hi\Xi-^] 2 Kf^ oo i^iiiXi^) 



Ee[Hi[X 



(0) 



with £b,,, {6) = + Hs,^,}, Ep^^, (9) = {Hi,,,P,, + Fa^^J ■ 

To show Equation (3.9) note, by 

dH (6, p) /dP = d {BPHi + BH2 + PH^ + H4} /dP = BHi + F3 

and 

dH ib,p) /dB = PHi+ H2, 

we have 

*/l.A,.(0., ;.) (O..; ^) = Q2AI. (^) + Q2,p.l, (^) - (0) , 

where 

^^{B,,Hi,,,+H-i,,,]Ee[Hi\X,,]-"^ 

X Kf^^^ {X,„X,,)Eg[Hi\X,,]-^ {P,,Hl,,,+H2.^A 
= -~Sb.^^ {0) Eg [Hi\X,,]~^ Kf^^^ {X,„ X,,) Eg [Hi\X,,]~^ £p,,, (9) 
Q2Ah (^) ^ {P^.Hl.^,+H2,^A^A,b{X,^) {O^.\0) 

= -£M. {0) Eg Kf^^^ {X,„X,,)Eg [H^[X,,]-"- £p,,, {9) . 

Thus, by part 5(c) of Theorem 2.3, 



IF2,2,^ [0) = n^, 



\ {<32,pl. i0) + ^2,61. {0) + IIFi.v...} [Ut'-" [9) 



= \{%,r>J^0)+%,bJ^0)} 

= Q2 p [9) = V [RHS of Equation (3.9)] 

since IFi,^^i2 is a function of only one subject's data and Q2pi2 ^2 bi2 (^) 

are the same up to a permutation that exchanges i2 with ii. 

To obtain -^^33^7 {9)i one must derive the influence function 
^■^1.^/2,2,^,(0,1,0,2; ) (^'3;^) of i/2,2,^ {Oi,,Oi^;9). The formula for IF^ ,^^^-^ (9) is 
given in Equation (3.13). A detailed derivation is given in our technical report. 
Here we simply note that the only essentially new point is that we now require the 
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influence function of Kf^ ^o {^in Xi^)^ which, as shown next, is given by 

To see that if Equation (3.7) held. Equation (3.10) would hold, note that for any 

. Differentiating 



with respect to t and evaluating at t = , we have 



path (t) with 9 (0) ^ fx{-), h {x) = £~^^^ 



Q = Ee [A7^,oo [x, X) h (X) S (0)] + E„ 

Hence it suffices to show that 

-Se[if/x,oo {,x,X)h{X)S{0)] 

= [{Eei^Kf,^^ {x,X,,)Kf^,,oo {X,„X,JS,, {0) \X,,}}h{X,J] 

But, by Equation (3.7), 

E, [{Ee {-i^/,,oo {x, X.,,)Kf,,^ (X,„ 5., {0) h (X,J] 

= E,[-Kf,^^ {x,X,,)S,, {0)h{X,,)]. 



Feasible estimators 

These "formal" calculations motivate a "truncated Dirac" approach to estimate 
-0 (6*). Let {zi (•)} = {zi [X] ; 1, 2, . . .} be a countable sequence of known basis func- 
tions with dense span in L2 {Fx) and define Zfc [Xy = (21 (X) , . . . , zu {X)). Define 

Kf^,k{X,,,X,,) = Zk{,X,,f [Ef^ [zfc(X )zk(xf]Y\k{X,,) 
to be the projection kernel in L2 {Fx) onto the subspace 

lin{zk {X)] = {ri^Zk {x)-rje R\n^Zk (a; ) G £2 {Fx)] 
spanned by the elements of Zfe {X). That is, for any h (x), 

[h{X)\Un{zk {x)}] 
^EfAKfx.k {x,X)h{X)] 

= zk {x)^ [Ef, \zk {X ) zk {X)^] Ef, [zk {X ) h {X)] . 

Then we can view Kf^_k (a^n , Xi^) as a truncated at k approximation to Kf^^oc (a^ii , 
Xi^) that is in ^2 [^x] x -^2 [^a'] and satisfies Equation (3.7) for all r {x) € 
lin{zk {X)^. Then a natural idea would be to substitute 



IF, 



[k] 



2.2,ip,ii ,12 



xE^[Hi\X,, 



-■p,l2 



with, for example, 



et,,, (0) = E^[Hi\X,,] * +i?3,»i} 
for the generalized function IF2.2,if!,ii.i2 (^) based on Equation (3.9) resulting in 



358 



J. Robins, L. Li, E. Tchetgen and A. van der Vaart 



the feasible second J7-statistic estimator 

V^f ) = (^) + IFi,,^ (9) + I¥i%^,^ (9) 

where 



2,2,V. 



= V 



IR 



(fc) 



To avoid having to do a matrix inversion it would be convenient, when possible, 
_ -| 1/2 

to choose Zfc {X ) ~ (pi^ {X ) / < fx {X) > where ipi {X) , ip2 {X) , ... is a complete 

orthonormal basis with respect to Lebesgue measure in R'^. Then Ey- [zk (X) x 



= Ikxk so 



KLeb,k {Xi^^Xi^) 



{7x(XiJ/x(X,j} 



1/2 ' 



where 



KLeb.k {Xi, , XiJ = (fk {Xi, )^ (fik (^^2 ) • 

This choice corresponds to having taken 

Kfx,oo {Xi^.Xi.-^) ~ A'ieb.oo {Xi^,Xt2) I {!x {Xi^) fx (Xi^)}^^^ 

in our formal calculations where K Let, 00 (^ii, Xi^) is the Dirac delta function with 
respect to Lebesgue measure. In that case with G = g {X) = fx {X) Eg [Hi\X] and 
G = g{X) = fx{X)E^[H,\X], 



(3.11) IF2,24^.^l.^2 (^) = "EMi (^) .9 J ^ K Leb.oo (X,, , X,,) g (X,,)- ^ Ep,,, (9) 

(3.12) 



IF. 



(fe) 



2,2,ip,il,i2 



= -efc.Ji (Xii ) ^ KLeb,k {Xi^ ,X,^)g (Xi^ ) ^ , 



Unfortunately, we show later that this choice for has good statistical properties 
only when fx is known to lie in a Holder ball with exponent exceeding max {f3h, Pp} . 
In our technical report we show one can proceed by induction to formally obtain 
that for m = 3, 4, . . . , 



IK 



(0) 



(3.13) 



Em — 2 / - \ 

,=0 c(m,j)x 



n 



X Kfx ,oo + 1 J Xi^ 



^iX,J-^ep,,^ (9) 



where c, (X) = E[Hi\X] and c(m,j) = (""2) (-1)^^'+^^ , which we then use 



to 



obtain IF 



(k) 



7n,m.ip.in 



Higher order influence functions 



359 



Statistical properties 



We shall prove below that the estimator ■i/'m has variance 

m — 1 



vare 



1 



■ max 



n 



1, 



when {zi {X) ; / = 1, 2, . . .} is a compact wavelet basis. (Robins et al. [IS] proves 
this result for more general bases). We also prove that the bias 



Ik) 



- V' {9) = TBk {9) + EB„, (9) 



of "0™^ is the sum of a truncation bias term of order 

TBk (9) = Op (^fc-('3'.+/3p)/d^ 

(for a basis {z/ (X) ; / = 1, 2, . . .} that provides optimal rate approximation for 
Holder balls) and an estimation bias term of order 



EB,n {9) ^oA[p-p][b- 



G-G 



G 



O-n 



Note this estimation bias is Op 



(m.-l)lig gp 
2li„ + d 2p.+d 2p„ + d 



It gets its name from the fact that, 



unlike the truncation bias, it would be exactly zero if the initial estimator 9 hap- 
pened to equal 9. Thus, the U-statistic estimator ■i/'m^ for our functional V' [9) (which 
does not admit a second order influence function) differs from the U-statistic esti- 
mators ipm of Eq. (2.3) for functionals that admit second order influence functions 

^(k) . ( ^ ni+l\ 

in that, owing to truncation bias, the total bias of i/'m is not Op\ 9 — 9 I . 
The choice of k determines the trade-off between the variance and truncation bias. 



(k) 



As fc ^ oo with 71 fixed, varg 
cally view the non-existent estimator -0- 
no truncation bias (and therefore a total bias of Op 



oo and TBk (9) 0. Thus, we can heuristi- 



ipm as the choice of k that results in 



) at the expense 



m+l 



of an infinite variance. Writing k = k (n) = n^, the order of the asymptotic MSE of 

^(k) 

ipm is minimized at the value of p for which order of the variance equals the order 
of the sum of the truncation and estimation bias. 

Remark 3.6. The models of Examples 1-4 exhibit a spectrum of different likelihood 
functions and therefore a spectrum of different first order and higher order scores. 
Nonetheless, because the first order influence functions of the functionals ip {9) 
share a common structure, we were able to use part 5c of Theorem 2.3 to formally 
derive ^ - (9) and, thus, the feasible IF^''^ , - {9] in Examples 1-4 in a 
unified manner without needing to consult the full likelihood function for any of 
the models. See Remark (2.5) above for a closely related discussion. 
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A critical non-uniqueness 



(k) 

We have as yet neglected a critical non-uniqueness in our definition of IF^^ ^^g^ 

and thus ipm"^ that poses a significant problem for our "truncated Dirac" approach. 
For instance, when to = 3, the two generalized ?7-statistic kernels /^3,3,i/>,ii,i2.i3 (^) 
of Equation (3.13) and 

£pM (^) 



IF. 



3, 3, 1/;, 11,42, 43 



are precisely equal, by Eq. (3.8); nonetheless, upon truncation, they result in dif- 
ferent feasible kernels; 



(fc) 



3,3,'0,«i ,^2 



r Hi 



and 



IF, 



(fc),* 



^^ (0) 



H 



?P,.3 (^) 



?(^.3)- 



with possibly different orders of bias. For simplicity, we consider the case where 
i/i = 1 as in Examples Ih. Let SB = B - B, SP = P - P, Sf = 5g = L- 1, a.nd 

1/2 



Zk ^ Zk (X) ^ {e (X)7p, [Xf] } (X) , 



then, 



E, 



IF, 



(fc),* 



3, 3, 1/), 11,42, 43 



= Eo 



SB,, X 



= Eh 



[sP,,ZkiX,,f]zk{X,,) 
\ f t( 

Eh 



-1 + 1 SBi, X 



- 1 2fc (^4: 



Eh 



V/(^n) 

oJ[b-b]{p-p]{g-g]' 



Zk {Xi,) 



Xi, 



T 

^Pi3 ^k,i3 



Zk {Xi, ) 

^fc.41 
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and 



That is, 



IF, 



(fc) 



3,3,i/Nii,*2,*3 



Eff 



Eh 



(Sf {X^-_ 
xZk.t 



zl,^E^[iSfiX,,) + l)SP,,Zk,.,] 



Eh 



— T 



z. - r^''-) 1 



+ Op ('{s - {p - p} {g - g}' 



/P 



3, 3, V', 21,22,*; 



-Eff 



IF, 



(fc) 



'5/(^^2)^., 

■T 



3, 3, 1/;, 11,12,43 



T 



Zk,\ 



xzr,„p^ 



ZkM^f {^12) 

'm,zu.,,] 

+ Op iy[B - P} {P - P} {g - g}' 

= [<5Pn^ [<5P|Zfc] [8f {X) \Zk\ 
- E^ [n^ [^p|Zfe] 5/ (X) [5P|Zfe] " 

+ Op (^{p - s} - ^} {g - g}' 

%[<5P|Zfe]x 

ni[^p|z,]n^[<5/(x)|^] 
-n^[jp|Zfc]ni[5/(x)|z,] 

+ Op (^{P - P} {P - P} {g - G^ 

where n^[/i(X) and Hi \^i{X) \Zk] respectively denote the projection un- 
der F (^■■,6^ in L2 ^P^ of h (X) on the k dimensional hncar subspace lin {zk {X)} 

spanned by the components of the vector {X) and the projection on the ortho- 
complement of this subspace. Since the basis {^pi {X) ;l — 1,2,...} provides optimal 
rate approximation for Holder balls, it is easy to verify that the difference is of order 



= E- 



o„ 



0p/d i_)g/d fip/d fj^Jd 

l + 23p/d l + 23g/d ^-/3b/(i _|_ ^ l + 2/3p/d l + 2/3j/d ^-^fg/d 

lip/d gfc/d 2/3g/d 

l + 2/3p/d 1 + 20^/d 1 + 23773 
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which we expect to be sharp for many bases, although not for Haar. For concrete- 
ness, we shall look at an example. Suppose j3b/d = Pp/d = 0.3 and Pg/d = 0.1, 
thus, by choosing k ~ , tp'^^ converges to "0 (^) &t rate n~^. In contrast, the 
order, 



min n t+W^ ttw^ k~ '^^ ' + n t+W^ i+w^/d j^-Pg/d ^ J - max 1, — 
fc \ \i n \ 

of the optimal root mean squares error of i/;!'^'''* that uses jp''^'^'* _ [O] is 71-0.477 ^ 

3, 3, 1/), 13 V / 

^-0.5 xhus, for many orthonormal bases, ^s*^''* converges to {0) at a slower rate 
than T/jg*^^ which uses ^^33^^- (^^^ Nothing in our development up to this point 
provides any guidance as to which of the many equivalent generalized U-statistic 
kernels should be selected for truncation. To provide some guidance, we introduce 
an alternative approach to the estimation of ip (9) based on truncated parameters 
that admit higher order influence functions. The class of estimators we derive using 
this alternative approach includes members algebraically identical to the estimators 
iprn but does not include estimators equivalent to less efhcient estimators such as 



An approach based on truncated parameters 

We introduce a class of truncated parameters -0^ (6) that (i) depend on the sample 
size through a positive integer index k — k (n) (which we refer to as the truncation 
index and will be optimized below), (ii) have influence functions 11?,^^ {0) of all 
orders to, (iii) equal ip (9) on a large subset Qsub,k of O and (iv) the initial estimator 
9 is an element of &sub,k so that the plug-ins ip (jij and i/'fc ^9^ are equal. To prepare 
we introduce a simplified notation. For functions h (o, •) or r (•) of 9, we will often 
write h {o,9^ and r (jij as h{o) and r, and [■] as E[-]. Similarly , we often 
write h (o, 9) and r (9) as h (o) and r, and E0 [■] as E [■]. Further we shall introduce 
slightly different definitions of truncation and estimation bias. 

Define the estimator 0rn,fc (9^ = "0 + ^^„^ o''; equivalently, ipm^k = 



0+IF ~. Then the conditional bias i? 



■pn 



^■0 of V'm,/c is TBk +EBm, where 



the truncation bias TBk = V'fe ^ V' is zero for 9 G Osub,k and does not depend 

"0171, fc 



on TO and the estimation bias EB^.k = E 'pm.k\9 — "pk is Op \ \\9 — 6*11™+ j by 
Theorem 2.2. Since, as we show later, the order of EBm,k does not depend on A:, we 
will abbreviate EBm,k as EBm, suppressing the dependence on k. Under minimal 



IF ~ 



whenever 



conditions, the conditional variance of 0m, fe is of the order of var 

k = k {n) > n. The rate of convergence of 0m, fc to tJj can depend on the choice of ipk- 
Nevertheless, many different choices ipk result in estimators 4'm,k that achieve what 
we conjecture to be the optimal rate for estimators of the form ipm,k- We choose, 
among all such 0fc, the class that minimizes the computational complexity of i}}rn,k- 
Specifically for all ipk in our chosen class and all j , I¥_.^ ~ consists of a single term 
rather than a sum of many terms. We conjecture this appealing property does not 
hold for any ipk outside our class. We now describe this choice. The parameter tpk is 
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defined in terms of k (n) —dimensional 'working' linear parametric submodels for 
p (•) and b{-) depending on unknown parameters and rji^ through the basepoints 
p{-) and &(■), where p{-) and b{-) are initial estimators from the training sample. 
Specifically let p (X) and b (X) be arbitrary known functions chosen by the analyst 
satisfying Eqs. (3.14)-(3.16) below. 



(3.14) 
(3.15) 

(3.16) 



p{X)b{X)E[Hi\X] > w.p.l, 



piX) 



b{X) 



< c\ 



b{X) 



p{x) 



< c*, 



pjx) 

b{X) 



has at least \max {13b, /?p}] derivatives. 



more aes- 



Particular choices of p {X) and b {X) can make the form of IF 

thetic. The choice has no bearing on the rate of convergence of the estimator Tpm,k 
to iIj{0). Often there are fairly natural choices for p{-) and b{-). See Remark 3.9 
below for examples. Let ak,r]j^ be fc— vectors of unknown parameters and consider 
the 'working' linear models 

(3.17) p* iX,ak) = PiX) + p {X)alzk (X) = P + PalZk, 

(3.18) b* (X,77j = b{X) + b(X)rjlzk (X) = B + BrflZk. 

We define the parameters rjf, [6) and Sfe {9) respectively to be the solution to 
= Ee[dH{b* {X,r]k),p* {X.au)) Idau] 



(3.19) 



(3.20) 



= Ee 



{H,b* {X,r]k) + H3}PZk 



= Ee[dH{b*{X,r],),p*{X,ak))/df],] 



Eff 



{Hip* {X,ak) + H2}BZ, 



The solution to (3.19) and (3.20) exist in closed form as 



(3.21) {8) = -Eg 

(3.22) ife {9) = -Ee 

Next define 6 (61) = b{-,i 



BPH.ZkZ. 



PBH.ZkZ^ 



Es 



Eft 



ZkB^H^P + H^] 



7/fc(0) andp(0) =p(.,0) =p* •,5fe(0) and 



^k{e)^Eg H{b{e),p{e) 



Note the models p* (-jCffc) and b* {-jrjf.) are used only to define the truncated 
parameter ■0^ (9). They are not assumed to be correctly specified. In particular, 
the training sample estimates p, b need not be based on the models p* (-,«/,-), 
b* (-jrjf,). We now compare our truncated parameter ipk (9) with (9) and calculate 
the truncation bias. It is important to keep in mind that b,p are components of the 
unknown 9 while p, b, p, b are regarded as known functions. 
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Theorem 3.7. If our model satisfies (Ai)-(Aiii) and 
SeOsub.k ^ {0;p{-) ^ p* {■,ah) for some ak or b {■) - 
then (9) = i'iO). 



Further TBk (9) ^ {9) ~ ^ {9) = Eg 



{b{9) 



- b* {■,Vk) for some ?/j,}ne 
b}{p{9) -p}Hi 



□ 



Proof. Immediate from Theorem 3.2 and Lemma 3.3. 

We know from the above Theorem that TBk {9) = for 6* e Qsub.k- However to 
control the truncation bias in forming confidence intervals for -0 (9) we will need to 
know how fast sup^ge {TBk (9)} decreases as k increases. The following theorem 
is a key step towards determining an upper bound. 

Theorem 3.8. Suppose b (X) andp {X) are chosen so that BPE [Hi\X] > w.p.l. 
Let 

Q = q{X) = i^BPE[Hi\X]y ' 

and Il[^h{Z)\QZk] and H-^ ^h{X)\QZk] be, respectively, the projection in 
L2 {Fx (2;)) of h{X) on the k dimensional linear subspace lin\^QZk} spanned by 
the components of the vector QZk = q {X) Zk (X) and the projection on the ortho- 
complement of this subspace. Then if Ai) — Aiii) are satisfied. 



TBk = E 



P-P 



QlQZk 



B-B 



QlQZk 



Remark 3.9. To simplify various formulae it is often convenient and aesthetically 
pleasing to have Q = 1. We can choose 13 and P to guarantee Q = 1 w.p.l. For the 
functional ijj [9) = Eg ^{X)p{X)] of Example la, Hi = —1 w.p.l. Thus choosing 
B and P equal to 1 and —1, respectively, w.p.l makes Q = 1 w.p.l. In the missing 
data Example 2a, the function Hi = —A so E [HilX] = 1/P and thus the choice 
B = —1, P ^ P makes Q = 1 w.p.l. Note since inference on ip (9) is conditional 
on the training sample data, wc view the initial estimator p{-) of p{-) from the 
training sample as known and thus an analyst is free to choose P to be P. 



Examples continued. In Example la, recall ip ^ E [BP]. Choose B 
w.p.l so Q = Q = 1, and take 13 e lin {Zk}. Then 



-P = 1 



B = B + n 



U[BlZk] 



B-B] 

p = n [PiZk] , 

TBk = E{ [n^ [BlZk] [P\Zk\]] , 
i^k^^-TBk^E{ll [BlZk] n [PlZk] } . 

Thus -tpk appears to be the natural choice for a truncated parameter. 



E[B],B = -1,P = P= 1/n, Q = 1,Q = [P/P] 
I S > ,n = TT (X) ,n = TT (X), we obtain 



In Example 2a with 
1/2 



1/2 



TBk = E 



{i} 



1/2 



B-B 



{f} 



1/2 _ ■ 
Zk 

1/2 _ ■ 
Zk 
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Thus the truncated parameter Tpk = 'ip — TB^ does not seem to be a particular 
natural or obvious choice. The complexity of V'fe is not simply due to the fact that 
we chose P = P rather than P = 1 as we now demonstrate. 

In Example 2a with S = -1, P = 1 , Q = ^f^/^ Q = tt^/^ 



Nonetheless we will see that, for either choice of yB, Pj , the parameter ipk will 
result in estimators with good properties. 

Remark 3.10. Henceforth, given (/3p, /?;,, (3g), {(pi (X) ,1 = 1,2,...} will always de- 
note a complete orthonormal basis with respect to Lebesgue measure in i?'' or 
in the unit cube in R"^ that provides optimal rate approximation for Holder balls 
H (/?*, C),I3* < [max /J,, Pg)] , i.e. 



(3.23) 



SUPhGH{l3',C)in 



4 J^^ (^h (x) - ^ ^, (x) j dx^O (k-^f''/'' 



The basis consisting of d-fold tensor products of univariate orthonormal polyno- 
mials satisfies (3.23) for all /?*. The basis consisting of d-fold tensor products of a 
univariate Daubechies compact wavelet basis with mother wavelet Lp^ (w) satisfying 



u"^^w {u) = 0, m = 0, 1, 



also satisfies (3.23) for [3* < M + 1. 



Theorem 3.11. Suppose that (Ai)-(Aiv) are satisfied, thatb{X) andp{X) satisfy 
{3.14) - {3.16) and that we take 



(3.24) 

where 

Then 



-1/2 



zk {X) = ^kjix) {e [H,\X] b {x)p{x)y 

^,j{X) EE [e (X)^, (X)^] {X) . 

{TBl{0)}^Op[k' 



supers 



-2(/36+/3p)/d^ 

Remark 3.12. Note if we have chosen b{-) and p{-) so that Q = 1 wpl then 

1/2 

exponent exceeding max{/3f,, /3p} , 



Zk {X) = if ■7{X) simplifies. The preceeding theorem does not hold with ip^ {X) j 

-,1/2 _ 

|/(X)> replacing ip^^{X) unless fx is known to lie in a Holder ball with 



3.2.2. Derivation of the higher order influence functions of the truncated 
parameter 

We begin by proving that the first order influence functions of "iAfc and if) are identical 
except with b {9) ,p{0) , V'fc (^) replacing b,p, 4> {&). 
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Theorem 3.13. 



with 



IF, ~ (6*) = V 



IF^ ~^i9) = H[b{0),p{0))-i^k (9). 
Proof. Since V-fe (6*) = Eg \h (6(6') ,p{9)' 



IF^~^{e)=H[bi9),piO)]-4'ki9) 



E 
E 



dH (b* (0)) ,p* (0))) /dff, 

H (h* (X,7j, (0)) ,p* {0))) /dal 



IF, 



l,afc(-) ^ ' 



But, by definition of 77^. {9) and ak (0), both expectations are zero. 



□ 



Note that rj^. (0) and ctk (0) are not maximizers of the expected log-likelihood 
for Sfc and rj^. This choice was deliberate. Had we defined TJj. (0) and ctk (0) as the 
maximizers of the expected log-likelihood, then IF^ ~ {0) would have had addi- 
tional terms since the expectations in the preceding proof would not be zero. The 
existence of these extra terms would translate to many extra terms in IF ~ (0) 
for large m leading to computational difficulties. Similarly had we chosen mod- 
els p* (X, cffe) = $ (P -h PalZk^ and b* {X^ji^) = <^ (b + BrjlZk) with $ (•) a 
non-linear inverse-link function, IF^ ~ {9) would also have had many extra terms 
without an improvement in the rate of convergence. 

The following is proved in the Appendix. 



Theorem 3.14. IF ~ = IF ~ +Y1 

is a jth order degenerate U~ statistic given by 



. , IF . ~ where IF . ~ 



IF. 



IF^ 



IF. 



(-1)-' [( 



HiP + H2]BZ 



Zk[HiB + H^]P 

■T 



PBH^ZkZ, 



E 



'k 



PBH.ZkZl 



y 



[PBHiZkZ^ 



E 



X \E 



PBH.ZkZ, 



PBHiZkZ^ 

Y' [Zfc (h^b + h^"^ P 



3.2.3. The Estimator ibm t = ■0 + IF ~ and its Estimation Bias 



We can now calculate the estimator '4>m,k = ip + IF^^ ~ by substitution of for 9 
in IF ~ = IF ~ (9) to obtain the following. 

m,4ik m.,'il)k ^ ' 
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Theorem 3.15. Suppose {3.24) holds and define Z{X) = E[Hi\X]. Then 'ipm,k 
ip + W~ + J2T-2'^ -7 where 



^ + IF^ = BPHi + BH2 + PH^i + Hi 



IF 



\ HiP + H2 ] BZI 



Zk(HiB + H3]P 



„,l/2 v'^AX) 
H^P + H,][S,] 

-1/2 Tp^ ^X) 



IF . 



= (-1) 



i-1 



PBHiZkZl 



kxk 




Proof. By Lemma 3.3 ^^HlB + PZk = ^^HiP + BZk = 0. 
Thus by Eqs. (3.21) and (3.22) (9) = Sfc (e) = so B (e) =B and P (?) = P. 



Fm-thcr, by Eq. (3.24) , E 

Ikxk . 



PBHiZkZl 



E 



PBE [H^\X] ZkZl 



E 



Q'^Zk^ 
□ 



Remark 3.16. The reader can easily check that when B = P = 1 and Hi > wp 



1, IF„ 



is precisely the same as /i^2 ii i-, 



9 \ of equation (3.12) in Section 



s=3 



3.2.1. By expanding the product <^ j ) —Ikxk f, the equality of 

depends on P and B only through their ratio. 

= -P = 1 w.p.l, Q ^ l,Hi 

I^^^.X) SoElZkZl 1 = TlnPrn 



IF ~ - and /i^'-'^'' _ [0] can be demonstrated for all m. Note that IF . . 



Example la (Continued), tp = E[BP],B 
-l,Zk=^,M so E ^'^ 



kxk- 
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and thus 
IF ~ - 



A-P]zl 



Z, (Y -B 



IF - - - (-1)^ 



A-P\ zl 



s=3 J 



Example 2a (Continued). Hi = -A, B = -l,P = P = 1/9, Q = 1, ij,' = E[B], 



Q = {P/Py^^ = {f } and Zfc = ^k^jiX); so E 



1/2 



hxk- Then 



- 1 , so 



(-1) 



TT 

J-1 



-1 u 



Zki{Y-B 



TT ' 



.s=3 

















{y-B)] 






IT 




= 1 ,Q 






and Zk 



E 



Q^ZkZ 



ki^ 



so 



Ikxk, P 



A Y - B 



IF„ 



/F 



jj,ipk,ij 



= (-1) 



TT 

J-1 



1 z 



ZkA Y-B 



i-uz 



Ikxk 

s=3 



V [z,a(y-b) 



Our next theorem, proved in the Appendix of our technical report, derives the 



estimation bias EBm. = E 



- i'k 



Theorem 3.17. Suppose {3.14)-{3.16) and (Ai)-(Aiv) hold then 



(3.25) FS,„ = (-1)' 



E 



X {E 



2 / B-B 



{ 



E 



9^ -T^T 



Q'^ZkZ^ 



Q ZkZj. 
} ^ E^kQ 



Ikxk 
2 I P-P 



\EB„ 



(3.26) 



< 





















■DO 




oc 



ii'55iir'(i+«p(i))x 



1/2 r 2 '^-'^/^ 

{/(p(X)-p(X))^dx} /(fe(X)-6(X)) 
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for m > 1, where 5g 



g{X)-g{X) 

a(x) ■ 



Remark 3.18. At the cost of a longer proof we could have used Holder's inequal- 
ity repeatedly to control 5g in the Lp norm H'^ijII^^j^ with p = m + 1 to show 



that \EB^\^Op [ \\5g\\ 

rn-\-l 



in — 1 
m+1 



b{-)-b{-) \\v{-)-p{- 



m+1 



I m+1 



Thus, \EB„ 



IS 



Op 

Theorem 2.2. 



, consistent with the form of the bias given in our fundamental 



Remark 3.19 (An alternate derivation of ipm,k)- The above derivation of 'tpm^k re- 
quired that one have facility in calculating higher order influence functions IF^ ~ , as 
done in the proof of Theorem 3.14 in the Appendix. However, there exists an alter- 
nate derivation of ijjm,k that does not require one learn how to calculate influence 
functions. Specifically, we know from Theorems 2.2 and 2.3 that in a (locally) non- 
parametric model IF^.^. ~ > 2 is the unique j*'* order U-statistic that is degenerate 

under 9 and satisfies 



(3.28) 



EB,i + E 



EB, = 0„ 



with EBi = E 



i'i\e 



— ipk- In fact, we first derived V'm.,fc by beginning with "01 



ip + IF^ ~ , calculating EBi = E ipi\9 — 'ijjk, and then, recursively for j = 2,. . . 
finding IF^.^. ~ satisfying the above equation. In fact if one did not even know how 
to derive IF^ ~ , one could begin the recursion by obtaining IF^ ~ as the unique 
first order U-statistic with mean zero under 9 satisfying — 'ipk + E 



IF, ~ \9 



3.2.4- The variance of tjjm.k ^ ip + IF^ ~ using compact wavelets 



In this section, we derive the order of the variance of ijjm.k when the orthonormal 
system {ipj (X)} used to construct our [/-statistics are a compact wavelet basis. 
First consider the case where X is univariate; without loss of generality, assume 
that X ~ Uniform[0, 1]. Because wc are primarily interested in convergence rates, 
the fact that X may not follow the uniform distribution will not affect the rate 
results given below, but can influence the size of the constants. We use (pj (X) in 
place of ipj (X) to indicate univariate basis functions. 

Let fc*, fc be integer powers of two with k > k*. Denote by (X) = (X) the 

k— dimensional basis vector whose first k* components (j>i (X) arc the fc*— vector 
of level log2 k* scaled and translated versions of a compactly supported 'father' 

wavelet (Mallat [10]) and whose last k — k* components <t>k*+i i^) are the as- 
sociated compact mother wavelets between levels log2 k* and log2 k. In partic- 
ular, one may use periodic wavelets, folded wavelets or Daubechies' boundary 
wavelets with enough vanishing moments to obtain the optimal approximation 
rate of O {k-'^^/'^) for /? = max (/3g, /3f,). The multiresolution analysis (MRA) 
property of wavelets allows us to decompose the vector space spanned by the 
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log2 (fc)-level father wavelets Viog2(fe) into the direct sum of the subspace spanned 
by log2 (fc*)-level father wavelets V\og^{k*) = {X) : a ^ | and the span 

of mother wavelets for each level between logj [k*) and log2 (fc) — 1 which we re- 
spectively write as 

Wiog,(fe.) = (X):aei?'^*}, 
(fco)+i = {a^02fc-+i {X):a(^ R^^ | , 



log. 



Then for any integer s with log2 (fc*) + 1 < s, we have 

As s ^ 00, the resulting basis system is dense in L2 {X) (Mallat [10]). Since, in fact, 
X is d— dimensional we require a generalization that allows for multivariate tensor 
wavelet basis functions. In fact, suppose — {X^^ , . . . jX*^) is now multivariate, 

and we again assume X ^ Uniform on [0, 1]"^. Given d univariate vector spaces 

Vl,log2(fc), V2,log2(fc), ■ • ■ , VdJog2(fc) 

respectively spanned by vectors (pi {X^^ , (p^ {X"^^ , . . . [X'^) , so that for 1 < r < 
d, 

^r,log2(k*) C VrJog2(/c*) + l C . . . C VrJogaC*:)-! C V^JogaCfc) 

and 

/ log2(fc)-l 

Kjog2(fc) = K,iog2(fc*) ffi I Wr,t, 

yt>=log2(fc*) 

One may define d dimensional tensor vector spaces 

3^d,log2(A:*), 3^d,log2(fc*) + l, ■ • ■ , J^d.log^ (fe) 

such that 

yd.log^ik') C 3^d,log2(fc*) + l C . . . C 3^d,log2(fc), 

where for s > 0, 

3^d,log2(fco)+S= Kaog2(fco)+S- 

l<r<d 

As s — > 00, the resulting tensor basis system is dense in L2 (X) (Mallat [10]). 
Next, suppose that we have a set of multivariate basis functions 



{^J^ (X),j =0,l,...,2m} 



such that for each fcj, ip^^ {X) spans ^r,iog2(fc r)' where fcj^r = kj. Define 

II • II2 as the L2 norm with respect to the Lebesgue measure. The following theorem 
is key to our derivation of the order of the variance of V'm.fe- 
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Theorem 3.20. For m > 0, 

771 

m+l 

X n k,. 

The following theorem is an immediate consequence of Theorem 3.20 obtained 
by taking kj ~ k* = k (which implies we use the father wavelets at level log2(/i;) but 
no mother wavelets.) 



Theorem 3.21. For all eQ, 

varg 



IF,~ {9) 



1 

n 



varg 



IF ~ (61) 



1 \ fk 
— max < 1 , — 
n \ \ n 



varg 



IF ~ (9) 



var-T 



IF 



1 fk 

— max < 1 , — 
n \ \n 



We now use Theorem 3.21 to derive the order of the conditional variance of V'm.fc 
given 9. 



Theorem 3.22. If supoeo 



f [o-9] ~ f {o-9) -.Qas\\9 



0, then for a fixed 



varg 



^ ~\9 



= varg 


IF 




77fl 


= var^ 


IF 




u 


m 


( 1 




\- 


X — max < 









m — l 



The proof is in our technical report. 

For a given m, the estimator tpm,kapt{m) that minimizes the maximum asymptotic 
MSE over the model M (6) defined by (Ai)-(Aiv) among the candidates ^/'m.fe uses 
the value kopt (m) = kopt (tj, n) of k that equates the order ^ max |l, (■^)"' ^} 



var 



to the order 



{TBkf , {EBra {9)f 



max 



2(m-l)gg 



2(/3fc + gp) 

k a 
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of the maximal squared bias. The estimator tpmopt,kapt = ^mopt,kopt(mopt) that min- 
imizes the maximum asymptotic MSE over the model M (O) among all candidates 



i^m,k is the estimator ipm,kapt{m,n) which minimizes ^ max 
3.2.5. Distribution theory and confidence interval construction 



We derive a consistent estimator of the variance and give the asymptotic distri- 
bution of il}m,k for any model and functional satisfying (Ai)-(Aiv). Let be the 
upper a— quantilc of a standard normal distribution, i.e. a N (0, 1). 



Theorem 3.23. (a) Let ~ = n-^Y 

1,1/"!= 



/2 ~ 



V 



IF 



for j > 2, and 



/2 ~ 



/2 L \^ m2 

l,il>k ^ 



i=2 



where IF ■ ■ 7 , ^ is the symmetric kernel of IF . ~ , , . We have, 



E 




= var 


IF^ 7 1 


E 


. 3j,'>Pk . 


= var 


IF.. 7 


E 


m,ipk. 


= var 


IF 7 



where var[-] = var-^[-]. 

(b) Conditional on the training sample, 

^ \ -1/2 



■m,kopt{m,n) 



- E 



n)\0\} 



converges uniformly for 6 Cz Q to a normal distribution with finite variance as 
n — !• oo . The asymptotic variance is uniformly consistently estimated by 



1 fk 
— max < 1 , — 
n \ \n 



nl,^|^k,pt(^,,^) 

Thus 

is converging in distribution to a standard normal distribution. 

7 7 

opt' ^ 



(c) Define the interval Crn,k = V'm.fci-ZaW^ ~ . Suppose kopt (fn, n) = ri^opti"^,"-) , 



Then for k* ^ nP , p* > p^^,{m,n), 



supeee 



Eg 



l^2,k'\0 



varg 'ip2.k 



Opil) 
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and \ ipm.k* ~ 4' (^) f /W ~ converges uniformly in 9 £ Q to a N (0,1). More- 
over, Cm fe* is a conservative uniform asymptotic (1 — a) confidence interval for 

(d) Suppose we could derive a constant Cuas and a constant N* such that 



sup 



m.koptijn-n) 



sup I {TB_ 



kapt{m.n) 



{0) + EBm {e)}\ 



< Cbtas < - max < 1, 



1/2 



for n > N* . Then 



Ba 



m,kopt{m.n) 



m,kaptim,n) 



± < 



Cfcjas < - max < 1, 



1/2' 



is a conservative uniform asymptotic (1 — a) confidence interval for (6). 

Part (a) of the theorem is an easy calculation. The asymptotic normality of 
4'm,kopt{m,n) IS bascd On ncw results on the asymptotic distribution of higher order 
fj-statistics with kernels depending on n to be published elsewhere (Robins et al. 

[IS])- 

Part (c) of the theorem implies we obtain a conservative uniform asymptotic 
(1 — a) confidence interval for any value of p* exceeding p^j^^i^m.n)- However, for 
the actual fixed sample size of our study, say n = 5000, there is no guarantee the 
interval of part (c) based on given difference p* — p„pj(m,n), say 0.3, will provide 
conservative finite sample coverage. 

Because of this difficulty, a better approach, described in part (d), would be 
to determine a constant Cuas that can be used to bound the maximal bias under 
the model at a sample sizes exceeding N* , with N* no greater than the actual 
fixed sample size n of the study. Then the interval BC, 



will be a honest 



conservative finite sample 1 — a confidence interval, provided that ip„ 



t{m,n) 



has 



nearly converged to its normal limit at sample size n. Unfortunately, as yet we do 
not know how to determine the constants Cbias and N* of part (d) as a function of 
our model and of our initial estimator 9. This is an important open problem. 



3.2.6. Models of increasing dimension and multi-robustness 

A model of increasing dimension. The previous results can also be used for 
the analysis of models whose dimension increases with sample size. In fact, consider 
the M (On"?), rj known, that differs from model M (0) in that, rather than assum- 
ing b {x) and p {x) live in particular Holder balls, we instead assume the working 
models of Eqs. (3.17) and (3.18) are precisely true for k = , so tp [9) = tpn^ [9) 
and the dimensions of b [x) and p {x) increase as . Valid point and interval esti- 
mation for %pn-^ (9) can still be based on the estimators -i/'m.fe except now (i) there 
is truncation bias only when k < , (ii) the variance remains of the order of 
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i max ^1, (;^)™ and (iii) the estimation and truncation bias (when it exists) 

orders will be determined by any additional complexity reducing restrictions placed 
on the fraction of non-zero components or on the rate of decay of the components 
of the vectors rj^r, (9) and (0), and, for estimation bias, by Pg as well. As a con- 
sequence, rriopt and kopt under model M (Qn^) will differ from their values under 
model M (O). Note we need not take k = n'' as we did in the heuristic discussion 
following Remark 2.8. Indeed -ipm^ ^ in that discussion corresponds to the estimator 
in the class ipm,,k=ni with the fastest rate of convergence. In general, i/jm^ ^ will 
have convergence rate slower than V'mopt,fcopf Furthermore, the discussion in Section 
4.1.1 implies that, when n'' 2> n and the minimax rate for estimation of "0(0) is 



slower than n ^1"^ ^ even '0„ 



will typically fail to converge at the minimax rate 



when complexity reducing restrictions have been imposed on 77,^,, {&) and a,i'j (ff). 



Multi-robustness and a practical data analysis strategy. Conditional on 
0, for m > 2, EB„i is zero and thus estimator tpm.k is unbiased for ipk if p(-) = 
p{-),b{-) ~ b{-), or g{-) = g{-). We refer to tpm,k as triply-robust for t/i^, gener- 
alizing Robins and Rotnitzky [17] and van dcr Laan and Robins [24] who referred 
to 'ipi as doubly-robust because of its being unbiased for tpi. if either p{-) = p{-) 
or b{-) = &(•). In fact, for m > 3, we can construct a modified estimator V'm°fc 
that is m + 1-fold robust as follows. Let 'gs {■), s = 3, . . . , m, denote m — 2 addi- 
tional initial estimators of 5 (•) that differ from one another and from g (•). Define 



IF 



-mod 



jy^U^'j^^:,, where 



IF 



- mod 



(-ir' [( 



HiP + H2]BZk 



[PBHiZkZk 



Ikxk 



n{ 

s=3 



PBHiZkZ. 



(PBHiZkZlY -Es 



[% [pBH.ZkZl 



PBHiZkZ, 
Zk[HiB 



with Eg defined like E, except with (•) replacing 5 (•). In the Appendix of our 



technical report, we prove that EB: 



mod 



E 



mod 



(-1)' 



E 



m 

n{^ 



X I I <j -B, 

s=3 



BPH, (i^) Zl] {e 

r 



V'fc is 



BPHiZkZ,, 



ikxk 



BPHiZkZl 



X !^E 

X |£; 



BPHiZkZ^ 
BPHiZkZl 



Es 



r 



E 



BPHiZkZf, 
BPH^ ' 



which is zero if p (•) ~ p (•), b{-) ~ b (•), g {■) — g {■)^ or if any of the m — 2 'gs (■) 
equals g (•). (We note that if p (•) — p (•) or b{-) ~ b (•), ip — tph and thus "0™°^ and 

4'm,k are unbiased for -0 .) 

In settings where the dimension d of X is so large (e.g., 30 — 100) that the above 
asymptotic results fail as a guide to the finite sample performance of our procedures 
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at the moderate sample sizes, say n = 500-5000, commonly found in practice, one 
might consider, as a practical data analysis strategy, using the m + 1-fold robust 
estimator "0^°^ with p {■) , b {■) ,g {■) , and the gs {■) selected by cross-validation as 
in van der Laan and Dudoit [2:!]. Specifically, the training sample is split into two 
random subsamples - a candidate estimator subsample of size Uc and a validation 
subsamplc of size n^, where both ric/n and riy/n are bounded away from as n ^ 
oo. A large number (e.g., n^) candidate parametric models of various dimensions 
and functional forms for p, b, and g are fit to the candidate estimator subsample 
and the validation sample is used to find the candidate estimators p{-) and b (•) for 
p and b and the m — 1 candidate estimators g (•) and (•) , s = 3, . . . , m, for g with 
the smallest estimated risks (with respect to an appropriate risk function such as 
squared error or KuUback-Leibler) . 

In the setting of very high dimensional X, current practice is to use a doubly 
robust estimator, say tpi with p and b selected by cross validation (van der Laan 
and Dudoit [23]). An m + 1-robust estimator V"™"*? ^^^^ n and m rather large 
may be preferable to a doubly robust estimator for two reasons. First, if one uses 
an m -|- 1-robust estimator of ip rather than a DR estimator, it may be more likely 
that the estimator will have very small bias, as it is more likely that at least one 
of m -|- 1, rather than one of two, models is very nearly correct. Second, because 
k ^ n, nominal 1 — a Wald confidence intervals centered at 4'm°k ^^^^ wider 
than the interval of length proportional to centered at -01. A wide interval is 

a more appropriate measure of the actual uncertainty about However, it is also 
the case that the bias of can exceed that of i/'i when all of the m+ 1 models 

selected by cross-validation arc far from correct, owing to the product structure 
of the estimation bias. The product of m 4- 1 errors, each greater than 1, will 
exceed the product of just 2 of the errors. We therefore plan to compare through 
simulations the finite sample performances of 4'm°k setting of very 

high dimensional X . 

4. Rates of convergence and minimaxity 

We consider a generic version in which we only assume a model and functional 
satisfying Ai) — Aiv). To examine efficiency issues, we first consider the estimator 
01 based on the first order influence function and sample splitting. Without loss 
of generality we assume Pp > (Otherwise simply interchange (3p and /3f, in what 
follows.) It will be useful to consider the alternative parametrization 

(1-0 

The (conditional) variance of tpi is of the order of 1 /n and the (conditional) bias 
of tpi in estimating tp is Op I n \''+^i^b d+2Hj, ) I . If A = and thus [3p = [3b, the 

bias of ipi is n^^rsj? and ipi is not 'n}/'^— consistent for ip when [3/d < 1/2. At the 

^ __2/3_ 

other extreme, as A — > oo, i.e. 0, the bias of ipi is n ''+*'^ which fails to be 

71^/^— consistent for any finite /3. 
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Minimaxity with g known. To further examine efficiency issues, it is instructive 
to first consider the estimation of ip with g [■) known. If <? (•) were known, we could 
set 'g {X) = g {X) when calculating il}m,k- Then EB2 = and ■02, fc would therefore 
be an unbiased estimator of ^k- Letting a superscript g denote the model with g 
known, it is easy to see that V'^g , g / » ^ would be ■02 fcs (2 1 where ^ (2) 

satisfies max (l/n, fc/n^) x var (^-2,^) = TB'l = fc-2(/36+/3p)/d ^ k~^fi/d. Solving 
this, we find that when P/d is greater than or equal to 1/4, we can take k^^^ = 



fiWJd < fi and 



Op [n 2 j regardless of A, which is, of course, 

if} ^ 
n In 



■'opt (2 ) 

the minimax rate. 

In contrast if j3/d < 1/4, fef^j (2) = and [02,^3^^(2 ) - V-'j 

an unpublished paper, we have proved that this is the minimax rate when g (■) is 
known. 

This raises the question of whether the lower bounds of rate n^^ for /3/d > 1/4 

_ 4/3 

and/or rate n for /3/d < 1/4 are still achievable when g is unknown, without 

restrictions on the smoothness of g. 

Before addressing this question, we take the opportunity to compare the relative 
efficiencies of competing rate-optimal unbiased estimators in the case of g known. 
This discussion will provide further insight into the results given in Remark 2.6 for 
models which are not locally nonparametric. 



Relative efficiency of various unbiased estimators with g known. For 

simplicity, we restrict the following discussion to the truncated version of the pa- 



rameter tp = E 



{b {X)Y , with b {X) = E[Y\X], g (•) known, and Y BernouUi. For 



this choice of ■0, g (•) is the marginal density of X . In this subsection, we assume 
g{X) is chosen equal to the known g {X) so E ZkZ^ 

S = -P = 1 and take B ^b{X) e lin{Zk}, so B 



= 1 1 



kxk- 



n BiZk 



E 



Also wc choose 

■Tl 



BZ, 



Zk and 



E 



{^[B\Zk]f 



do not depend on B. Further we only concern ourselves 
with efficiency relative to the n observations in the estimation sample. We thus 
ignore any efficiency loss from using N — n observations to construct 6. 

Let = {b : X 1-^ b{x) £ [0,1]} C O denote the subset of O corresponding to 
the known g, which consists of all functions from the unit cube in i?** to the unit 
interval. The model M (O^) is not locally nonparametric. For example, the first 
order tangent space Fi (9) does not include ffist order scores for g. Its second order 
tangent space r2 (b) does not contain second order scores for g or mixed scores for 
g and b. Rather, r2 (b) is the closed linear span of the first and second order scores 
for b. Thus 

r2 (6) = {§ (a, c) ; varb [§ {a, c)]} < 00; a e A, c e C } 



where 



(a, c)^{iY-B)a (X)}^ + (Y - B)^ c (X„ X,) [Y - B),, 



and A and C are the set of one and two dimensional functions of x. Since, for 
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b e lin{2fc (x)}, 



'4'2.k {h^ 



'2,k 



Y - B] Z 



2B{Y 

■T 



B 



Y - B 



is unbiased for il'k (b) = E {ll }^ in model M (O^), we know, by Remark 

2.6, that IF'^-''— (6) for V^fc (6) is the projection Ilf, tp2 k ~ ''Pk {b)\T2{d) of the second 

2,tl)h ^ _ '- ' ^ ^ - 

order influence function '02,fc~'0fc (^) onto T2 (b). Now if '02,fc~^fe (b) was an element 
of r2 (&), ip2k — 'ipk (b) would equal IF^-'^ (b) and thus be second order 'unbiased 

2,ipk 

locally efficient', at & € 0^, as defined earlier in Remark 2.6. However we show 
below that, when b {X) = c for some c w.p.l does not hold, ip2,k — "fpk (b) is not an 
element of r2 (6) for any b. Rather, a straightforward calculation gives 



IF'^^i (6) = V ■ 

2,ipk 



2E 

(Y- 



' — T' 

B)z': 



Zk {Y - S)J ^ 
[Zk {Y~B)], 



Now one can check that -tpk (^b^ +IF^'^~ is a function of 5, so by Theorem 2.7 
of Remark 2.6, we conclude no unbiased globally efficient estimator exists. However, 



we prove below that tpi^ I b 



b] and ^/;2 k have identical means. It follows 



that ipk (tj + IF^-^i (tj is an unbiased estimator of i'k (b) === E {H [B\Zk]) 
any b G lin {zk (x)}. Thus, for a given choice of e lin {zk (a;)}, ipk {b^ +IF 

is second order unbiased locally efficient at & = 6. However, one can show using a 
proof analogous to that in Theorem 3.22 that for k •^■n? 



2.^k 



for 
b 



varj, 
= 1 



2:Wk 



/varh 



IF'^i ib) 

2,ipk 



Op 



b-b 



Henceforth we assume that b lies in a Holder ball H{(3i,,Ci,). That is, we con- 
sider the submodel & G 6^ fl H{f3b,Cb) and assume b{x) G lin{zk {x)} converges 



uniformly over 0^ n 



to 6 in sup norm at the optimal rate of f j^^^ j 
H{(3ii,Cb). The submodel and the model have identical tangent spaces. For all 
6 e ef n H{Pb, Cb), {max {n-\ k/n^))'^''^ |?Afe {Pj + W^l (fj - i^k (6)} has an 
asymptotic distribution with mean zero and variance equal to 



lim (max in ^,k/n'^)) 



vart, 



iF'^i (b) 

2,4>k 



for all b £ n H{f3b,Cb)- In a shght abuse of language, we shall refer to 
varf, IF'^-'^ (6) as the asymptotic variance of \i^k{b] + IF*^-^— (b) —^pkib)\- 

L 2,tpk i I \ / 2,tpk \ J ) 

Thus, as with standard first order theory, even when no unbiased estimator has finite 
sample variance that attains the Bhattacharyya bound for all & G 0^ n H{(3b, Cb), 
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there can exist an unbiased estimator sequence whose asymptotic variance does 
attain the bound globaUy. 



We next compare the means and variances of ipk 
the two estimators are algebraically related by 



b ) and '02, fc- Now 





+ I¥^^£ (b]\ 


+ {v 




- E 


B^' 




2..ipk V / J 











Since 

2,Vfc 



B^ 







E 



B^ 



is unbiased for zero, wc conclude that ip2,k and 



have the same mean but var^ 



/varfc 

b {X) = b (X) ~ c w.p.l for some c. Thus, since ipk (^bj 



W^l (b) > 1 except when 



2,iAfc 



has asymptotic 



variance varj, 



IF' 



2,iAfc 



(6) and, except when b (X) = c+Op (1), var ( V 



B^ 



,-1 



we conclude the asymptotic variance of 'ip2,k attains the bound vart, 



W'^l (b) 



when k ^ n, but exceeds the bound when k < n, except when b (X) = c + Op (1). 
Finally, for completeness, Robins and van der Vaart [] !)] considered an alternative 

particularly simple rate-optimal unbiased estimator of ipk (b) = E {ll [_B|Zfc]} 



given by ipRv 
is 



v| YZ^ [Zfey]^. |. The Hocffding decomposition of T/'ji'y — 'i/'fe (6) 



V 



E 



— T 



ZkY ~ Vfe (b) 



— T 



■ — T 

BZk 



\ZuY-E[BZk\]] 



^W^^L {b) + Q + T 



ith 



Q=Y[{U[B\Zk]B-i'}] 

T = W 



2 iB,Zi,.iZkj 



Il[B\Zk]^){Y-B)^ 
B,zl,Zk,jBj - n [B\Zk] , S, - n [B\Zk]^ Bj 



Since, except when B = c w.p.l, varf, [Q) 



and vart [T) x k/n? , we conclude 



that the asymptotic variance of ip exceeds the bound var^ 



IF — (6) regardless 

L 2,i/'ic J 

of whether k n does or does not hold except when b (X) = c w.p.l. 

Minimaxity with unknown g and f3/d > 1/4. Wc now show that the bomid 
for (3/d > 1/4 is achievable for each pg > 0. Consider the estimator V'm.fc with 
„T+W3 < k <n and 



m > 1 



!3b 



(3p \2Pg + d 



d + 2(3b d + 2/3p J (3, 



so that EB„ 



+ d+X + d+%p ) j jg . Then var (ijr,i.k^ 



1/n, TBI ~ Op (1/"^) a-iid EBf-^ = Op (V") so V'm,fc will be n 2 —consistent for V'- 



If A = and (3 < 1/2, the above expression implies that m > 



d-2/3 



2(2^ (2/3g+d) 



/3s 



+ 1 
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for n 2 —consistency. Similarly, if A — > cxd, i.e. f3b — *■ 0, it is necessary that m > 
2(4:13 +d) I {2p''+d) ^'^^ ni— consistency. These results imply that estimators tpm,k 
in our class can always achieve —consistency whenever /3g > 0, but for fixed 
/3 < d/2, the order m of the required {/-statistic increases without bound as the 
smoothness (3g of g approaches zero. 



EfHciency. We now show that when /3/d is strictly greater than 1/4, we can con- 
struct an unconditional asymptotically linear estimator based on all N subjects with 
influence function J2iLi ^^1,^)^ (^) by having the number of the N subjects al- 
lotted to the validation sample and analysis sample be N^~'^ and n = n{e) = N — 
N^~'^, respectively, for 1 > e > 0. It then follows from van der Vaart [2G] that the es- 
timator is regular and semiparametric efficient. Specifically, suppose /3/d = 1/4 + 5, 

6 > 0. Consider the estimator i/'m*,fc with m* > 1 + | 2(T37) ~ 1+21% ~ d+2i3p } ^^7^ 



so that EBm* ^ Op\N 



7i(e)T+^ < 71 (e) so that TB| = Op{l/N) and var 
2. Then, by our previous results. 




is Op {N^^/"^) and k = 



= Op (l/N) for J > 



m* .k 



It remains to show that 

N 



1=1 



4=1 



But the LHS is 



"(e) r /' ^ 1 ^ 

(e)-5:/F,,,(^)|l^-lU^- 

i=l ^ ■' i=n(e) + l 



p{n{e)-'/'N-^)+Op[N d-V^AT-i) 



= Op (iv-i/^iv-^) + Op (iv-i/2iv-^/2) 

Adaptivity when f3/d > 1/4. We next prove that if we let n = n (e) ~ N — 

N^~\ m = m{N)= o{N) with In (iV) = 0{m{N)) and k = n(e)/ln(n), ^,„,fc 
will be semiparametric efficient for each /3 > 1/4, provided ||5 (■) ~ .9 (OIloo ~ 



Up 

EB, 



Clearly, the truncation bias is o (N ^/^). The estimation bias 



m{N) 



is Op (^rr.(iV(i-))-^[™W"^l7V -(^-^H A+^iy 



Thus EB„ 



Op (iV-1/2) if ^ (iv(i-^)) 



-2[m(Af)-l] 



0b I 



So we re- 



oo. 



quire 2[m (iV) - 1] In {m (iVd-)) } / [i - (1 - {-^ + -^}] In (iV) 
which is satisfied if In (A^) = 0{m{N)). In the appendix of our technical report 
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we prove that var^ 
Op (m {N^^-'^y^ 



V'm.fe 

Now 



var-- 



m.k 



{1 + Op (1)} provided \ \g{-) - g (• 



var' 



But 



/=0 



1 — [mn) ^ ^ ' ' 
1 - {hm}"^ 



m{N) 

o I E 



= (l + {lnn}" 



so vare l^-^m^fc j is n ivar^|/Fi^,/,^j (^^^ | {1 + Op (1)} = n Var {IFi,,^} {1 + Op (1)}. 
The proof of efficiency now proceeds as above. 

Alternative estimators v^fhen f3/d > 1/4. When f]/d > f/4, there actually 
exist, at least for certain functionals in our class, n~ -consistent estimators of ip 
that are much simpler than our very high order [/-statistic estimators. For example 
consider the expected conditional covariance ip = E [cov {A, Y\Xy\ of Example lb 
with d = \. 

Example lb (Continued). Number the study subjects i = 0, . . . , iV — 1 ordered 
by their realized values Xi, where we have not split the sample. Following Wang et 
al. [27], consider the difference-based estimator 

N/2-l 

i)d^N-^ E {Y2^A2^+Y2^+lA2^+l-Y2i+lA2^-Y2^A2i+l} , 
i=0 

which has conditional mean given {Xi, . . . , X^} of 

N/2-1 

E cov{A,Y\X2^} + cov{A,Y\X2^+l} 

N/2-1 

+ E i{HX^+i)-b{X,)}{p{X,+,)-piX,)}). 

i=0 



Hence 



E 



■ipd-4' 



N/2-1 



J2 {b (X.+i) - b (X,)} {p - p (X,)} 



1=0 



= Op Y: E{X.,+, - X.,Y^ = O (7V-2/3 



i=0 



by the theory of spacings (Pyke [13]). But O (iV^^/Jj jg (Ar-i/2) ^hen /? > 1/4. 

^ / 1 \ ^ 1/9 "vai'e ( ) 

The variance of ipd is O {N~ ) so V'd is A^ ' — consistent. However, yar9(i[Fi ^(S)) 

1 + Op (1) so 4'd is not (semiparametric) efficient. As discussed by Arellano [1], by 
using a m"^ order rather than a second order difference operator and letting m —> oo 
at an appropriate rate as ^ oo, the m*'' order estimator V'd can be made efficient. 
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Minimaxity with unknown g and /3/d < 1/4. Consider next wiietlier tlie 

4,6 

lower bound of ?i ^3+^ for < 1/4 is acliievable when g is unknown but Pg > 0. 

4.13 

We will show the next section that the bound n is achievable provided 



2(igld+l (A + 2) 

i.e., (}g > (A+2) (i+Wdt-4y/3/rf)(i-4/?/rf)( A+1) ■ To attain the bound n-^^ when- 
ever Equation (4.1) holds, we introduce new more efficient estimators, owing to the 
fact that an estimator V'm,fc in our class can attain the bound n V3+d only in the spe- 
cial case where the second order estimation bias EB^ 



2 



Or, ( n \Vg+d + d+2fi,^ d+2i3p) \ jg ipgg than n~^i^ 



For a fixed /3 = (/3p -I- Pb) /2, the right hand side of Equation (4.1) is minimized 
over A > at A = 0. At A = 0, Equation (4.1) reduces to 

(4 2) ^^/^ > l^M^B/d ^ 

^ ' 2/3,/rf+l 1 + 4/3/d^/^ ^ 



(4.3) Pg > 



/3(l-4/3/rf) 
1 + 2/3/d +8 (/3/d)'" 



The right hand side of Equation (4.1) increases with A with asymptote equal to 
twice the RHS of Equation (4.2) as A ^ oo. Hence, in order to attain the optimal 

rate n <''3+<' when /?p = 2/3 and /3f, = 0, the quantity 2f3'+d ^^^t be twice as large 
as when (3^ ~ 13^ = j3 . 

In the next section, we construct an estimator with a convergence rate of 

log (n) n~4/3+d at the cut-point i^2pg ~ ^^i+4p^ ■ ^'^^^ paper we do not consider 
the construction of estimators that arc rate optimal below this cutpoint. 

However, for the special case A = 0, in an unpublished paper Li et. al. [9] 
have constructed estimators which converge at a rate given in Eq. (1.3), whenever 
inequality (4.1) fails to hold . We conjecture that this rate is minimax, possibly 
only up to log factors, when inequality (4.1) fails to hold and A = 0. At the 
cut-point i_^2pg " *'\~+4g'^ ' obtain m* ~ and thus Equation (1.3) becomes 

4, a 

log (n) n 4/3+d ^ in agreement with the rate of the estimator of Section 4.1.2 below. In 
the extreme case in which /3t, ^ with /3 remaining fixed, log (n) n ^ i+2fjg/d 2/3/d 

\og{n)n~^~^ ^+'^'^!>^^^^~''''^^''^ = log (n) n^^'^/'', which agrees (up to a log 

factor) with the rate of n~'^^/'^ given by the simple estimator of Wang et al. [27] 
analyzed above under "Example lb continued." Based on the arguments given 
in the Appendix of our technical report, we conjecture that when (3 < d/A and 



r (n) 11^— gll^ = Op (1) for some r (n) 00 as 71 



{n '^^ / (\ognf^ for some natural number t and any m = m{n) > 

4if3/df login) , h-k(n)- 71 "'m^T-/' 

[1+20/ d) iog(r(n)) <J,nd K - ft (^nj - n I ) 

Remark 4.1. Suppose h{-) = E [Y\X — ■] is known to be contained in a Holder ball 
H (/3, C) . We provide a heuristic argument as to why the minimax rate for the linear 
functional b (x) does not depend on a priori knowledge of the smoothness of fx {x) 
but the minimax rate for the functional tp = E [b'^ (X)] may when (3/d < 1/4. First 
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consider the case where fx {x) is known. Let {4>i (x) 
Unear independent basis for L2 {Fx)- Define 



,1 = 1, 



, } be a complete 



T 



Efx 



h{X)cp^ {XY 



Epx 



Then bk [x) = X]/=i n \b{x) (x)] — "qj^cj)}. (x) is the projection in L2 {Fx) of 
b {x) onto hncar span of the first k basis functions. With fx {x) known and ry[ = 



E 



Fx 



(t)f^{X) (l)j^{Xy , unbiased estimators '}2d=i^4'k{x) and 



r]l ^E (j)/. {X) (j)f. {X) 7]2,k of the the truncated functionals bk {x) = rjl (pf. {x) 



and ipk = E 



bUX) 



E 



for b [x) and ■(/;, when k 



vI<f>k{X)<f>k {X)'vk 



are, respectively, rate minimax 



^opt 



is chosen to equate the order of the respec- 



tive variances k/n and max {^^^k/m?') with the order of the respective squared 
truncation biases \b [x) — bk {x)'^ = fc~2^f/d (^^^^ = k^^^/''-. For b{x), 



k^pf = „i/(i+2/3/d) ^ n and the rate is n-^/(''+2/3)^ p^^. ^^^^ ^ „2/(i+4/3/<i) ^ ^ 
and the rate is n~'^^^^'''^/^^~^'^^/'^^ when /3 < 1/4. The minimax rate for b{x) with 
fx {x) unknown and without smoothness assumptions imposed is the same as with 
fx {x) known, since, subject to some regularity conditions, for k, 



opt 



< n. 



P. 



Yd^^^^^ixy p„ <j>,^^^{x)<j>,^^^{xy <i>k.^A^)-vl,Ak.,A^) 



^k{x)<t>k {xy 



is not 



remains of order Op {kopt/n). In contrast, with k > n, P„ 

invertible. It is for this reason that the minimax rate for ip with fx {x) unknown is 
slower than unless the model places sufficient restrictions on the 

complexity of fx {x). 

Improved rates of convergence w^ith X random in a semiparametric 
model. We now, as promised in the Introduction, construct an estimator of ct^ 
under the homoscedastic model [FjX] = b{X), var = cr^ with X ran- 

dom with unknown density that, whenever (3 < min{l, d/A} and, regardless of the 

__i£/d_ 

smoothness of fx {x), converges at the rate n -i/f/rf+i , which is faster than equal- 
spaced non-random minimax rate of n~'^^/'^. Specifically we divide the support of 
X, i.e. the unit cube in i?'*, into k — k {n) — , 7 > 1 identical subcubes with edge 
length k~^/'^ . Wc continue to assume the unknown density fx {x) is absolutely con- 
tinuous with respect to Lebesgue measure and both it and its inverse are bounded 
in sup-norm. Then it is a standard probability calculation that the number of sub- 
cubes containing at least two observations is Op (ti^/Zc). We estimate cr^ in each 
such subcube by {Yi — Yj)^ /2, where, for any subcube with 3 or more observations, 
i and j are chosen randomly, without replacement. Our final estimator of 
is the average of our subcube-specific estimates {Yi — Yj) /2 over the Op {ri^/k) 
subcubes with at least two observations. The rate of convergence of the estimator 

is minimized at n ^f/ti+i 



by taking k ~ n i+^f*/'' , as wc now show. 



We note that 
E 



{Y,~Yy' /2\X.„X, 



+ {b{x,)-b{xyf /2, 



\b{X,)-b{Xj)\ = 0\\X,-X,f by /? < 1, and \\X,-X^ 



^l/2Q ^^-l/d) 



when Xi and Xj are in the same subcube. It follows that the estimator has vari- 
ance Op{k/v?^ and bias of 0(fc^^'^/'^). To minimize the convergence rate we 
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equate the orders of the variance and the squared bias by solving k/'n? ~ k '^^^'^ 

2 

which gives k = n ^+*f>/<' . Our random design estimator has better bias control 
and hence converges faster than the optimal equal-spaced fixed X estimator, be- 
cause the random design estimator exploits the Op ^n^/ni+*^'^ random fluctu- 
ations for which X's corresponding to two different observations arc a distance 
of O I \ j apart. Our estimator will not converge at rate n -ifVd+i 

to ii^ [var (y in our nonparamctric model, because it then no longer suffices 
to average estimates of var only over subcubcs containing 2 or more obser- 

vations. Indeed, when var depends on X, the estimator = ct^^ satisfies 

{a2-£;[/fe(„)(X)var{y|X}] / E[fk^„){X)]} = Op(n-W^), k (n) = n^^, 

/fc(n) (^) is l/k{n) times the integral of fx (x) with respect to Lesbegue measure 
over the subcube containing X . 

Remark 4.2. Consider again Example Ic with r (6*) being the variance weighted 
average treatment effect. We impose no smoothness assumptions on fx {x). The ar- 
gument in the previous two paragraphs implies that if /3 = {/3p + Pb) /2 < d/4 and 
max /3b) < 1, we can construct an estimator r of t (9) that converges at rate 

n wa+T when the semiparametric model (3.5) holds, which is faster than our con- 
jectured minimax rate of n"^'^/'' for r (9) when (3.5) is not assumed. Specifically, 

2 ^ ^ 

again create k = n i+"'''/<' subcubes. Let r make the sum ijj (r) over subcubes contain- 
ing at least 2 observations of {Y* (r) — Y* (r)} {Ai — Aj} /2 equal to (treating 
subcubes with greater than 2 observations as above), where Y* (t) y* — tA. 
When (3.5) holds, cov {Y* {t (9)) , A\X} — 0. Thus, an argument analogous to 
that above implies that ^ {t (9)) converges to cov {Y* {t (9)) , A\X} = at rate 

4/3/d 4/3/d 

n 4.13/d+i _ That r converges to t (9) at rate n 4/3/d+i jg then proved by a Taylor 
expansion of = -0 (t) around r (9) . 

4-1. More efficient estimators 

Case 1: The estimation bias of the third order estimator is less than the 
optimal rate 

In a (locally) nonparamctric model Ai (8), the estimator Tpm.k = iJj + is 
essentially the unique m~th order U-statistic estimator of the truncated parameter 
~ . . . . f ^ 

Ipk for which the leading term in the bias is O 9 — 9 j . However, when the 

minimax rate of convergence for ?/; is less than n^^/^, other to*'' order U-statistics 
estimators will often converge to i/'fe (and thus tjj) at a faster rate uniformly over the 
model than does any estimator ijjm,k (constructed from an estimated higher order 
influence function IF^^ ~ for ipk) by tolerating bias at orders less than to -|- 1 in 
exchange for a savings in variance. 

Remark 4.3. A heuristic understanding as to why this is so can be gained from the 
following considerations. The theory of higher order influence functions as developed 
in Theorems 2.2 and 2.3 is a theory of score functions (derivatives). Thus it can 
directly incorporate the restriction that a function, say b{x), has an expansion 
^ (^) ~ X^z^i Vi^i (^) for which rji — for ^ > fc, as the restriction is equivalent to 
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various scores being equal to zero. However the theory cannot directly incorporate 
restrictions such as Y^'^f^'Hi = k~'^^'' or -qi oc ^"(''''+2) that do not imply any 
restrictions on score functions. Thus to find an optimal estimator, one must perform 
additional "side calculations" to quantify the estimation and truncation bias of 
various candidate estimators under these restrictions. As the assumption that h {x) 
lies in a Holder ball can be expressed in terms of such restrictions, this remark is 
relevant to a search for an optimal rate estimator. 

We now construct more efficient estimators. We first consider the case where 

Pb,f3b, and (3g are such that the estimation bias O n s+w^j 1 Qf 

the second order estimator is greater than O ^ji^^^+d^ the estimation bias 



O ( n V2/3g+ci + d+23, + d+2f!p; ) of t]^g fi^jj-jj Qj.(jgj. gstlmator is less than O ( ?i 



That is, 

(4.4) n \^Ps+'' ''+^tib ''+^tip) < n~3^+d <n \2i3g+d~^ d+2ii^~^ d+20j, ^ 



Then the most efficient estimator V'm,/c m our class has rate of convergence slower 

ifi ^ ' ( fs I 13b I gp 'I 

than 71 4^i+d because "02, fcopt(2) converges at rate n VWr^ ^+2^/ determined 

by the second order estimation bias and, for m > 3, ipm.koptim) converges at a rate 
no faster than n^™) = n-*^3/((3-i)+44) ^ min^^.^^gj „-4f m/((m-i)+4f ) ^ 

obtained n-4^"/((™-i'+'^^) as {k-^f^/'^f^^ , where k solves the equation 
^m/^m+i ^ fc-4;3/d ^j^^^ cquatcs thc variance fc™/n'"+i of IF„ to the squared 
truncation bias k^'^^^'^. 

First, for the remainder of the paper, we redefine = Zk{X) = (^j, j{X){E[Hi\ 

X]b{X)p{x)}~^/'^ by redefining (p^^2 fi^) to have orthonormal components un- 
der L2(^x)such that the linear spans of {ip^ f{X), . . . , (p^^ fi^)} and {(pt{X), . . . , 
(Pn2{X)} agree for t < v? . This can be accomplished by Gram-Schmidt orthogonal- 
ization of (p^iiX) in Li(Fx) beginning with ip^iiX^ and working backwards. 
To describe our more efficient estimator, define for nonnegative integers 
(0) , fc (1) , k* (0) , k* (1) with k (0) < k (1) and k* (0) < k* (1) the [/-statistic 

^3 (^fc(o),fc, (0)j "^l,^3(^Mo),fc*(o) 

with 



^3 \^fe(o),fe*(o) y 
— <^n'^fc(0),n 



fc(l) fc*(l) 

E E 

Si=fc(0) + 1 S2 = fe*(0)-|-1 



BPH^ 



- J{fe(l)-A;(0)}x{fc'(l)-fc*(0)} I -^fc'(0).i3^j3 



Zsi (X.J (X.J - / [Sl = .S2] 



X 2^2 (^ia) A. 



where zlf^^ 



^fc(0) + l 



, ^fc(l) 
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As an example, IF^^ ~ = IU3 (q,q ). We can identify (^^(0]' ^.[oj) ^^^^ the rectan- 
gle in i?2 defined by {(ri, ra) ; fc (0) + 1 < n < fc (1) , fc* (0) + 1 < ri < fc* (1)} with 
(fc (0) + 1, k* (0) + 1) and (fc (1) + 1, k* (1) + 1), respectively, the vertices closest 
and furthest from the origin. Thus IF^^ ~ = U3 (q,q ) is identified with the rectan- 



gle {0,0) ■ Indeed we can write 



E 



BPH, 



Zs, (XjJ 2:^2 {X,:,) - / [si S2] 



(^i.^2)e(.(o):.*(o)j [ X zs, (X,3) A,3 
where, here and below, si and §2 are restricted to be integers, so (si,S2) G 
(fe(cij' fc*(o)) lattice points of the rectangle. 



We next study the variance of ILJ3 



fe(i).fe*(i) 

fc(0), fc*(0) 



It follows from Theorem 3.20 above 



that the number of lattice points in f^|oj' fe»(oj j proportional to the variance of 



U 



fe(l), /=*(!) 



3 \k{0), k'{0) 

and var 



so if fc (0) < fc (1) and k* (0) < k* (1) then var 



^3 Ifc(O), fc*(0) 



^3l0, 



arc both of order k (1) k* (1) /n"^. Hence the order of the 



variance ofUa (^k[o)' k'io]^ determined by the vertex of the rectangle (^fe(Jj'fc.|oj 
furthest from the origin. 

In contrast by a theorem in the Appendix of our technical report, the mean 



E (n [Sb\zl^]] SgQ'U [Splz'^.ll]] ) (1 + o, (1)) 

with Sb = PE {Hi \X) {b -B^., 5p = BE {Hi \X) (p-p), 5g = 

s 

3 \k{0). fc*(0)^ 

k (O)"'^" k* (0)''^'' (n/ logn) WTT 



Q2 = BPE{Hi\X). It follows that if k{0) < fc(l) and fc* (0) < k* {!) then 



E 



k{i),k'(i) 

3 U(0), fc*(0) 



and 
O 



are both of order 

-fig 



This "bias" is a "product mixture" of truncation bias through the term 

— 3 —Q "^g 

k (0) k* (0) and estimaton bias through the term (n/ logn) ^i^g+i ^ ggg ^j^jg 



for £; 



U3 g;;j;^:gj)],we'supout 



SgQ' 



from 



E 



n 



SgQ-'U 6p\Z 



I— fe*(i) 



which is 



(n/ logn) 2^s+i 



n 



n 



We then apply Cauchy Schwartz to 11 Sb\. 



n 



noting that 



E ( |n (5fe|zj!|Qj I j = O ^fc (0) Again a more careful argument using 
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Holder's inequality would show the log factor is unnecessary. Hence the order of 
the mean of U3 (^^q]' fc-(oj) determined by the vertex of the rectangle (^^[J]' ^ 
closest to the origin. 



Motivation. With this background we are ready to motivate our new estimator. 

2 

Recall from Section 3.2.5, that with g known, the choice fcfpj (2) = n i+^i^/d gives 

and 



4|3 



opt 

because the truncation bias 



variance are of order n and the estimation bias is zero. Any choice of k larger 

than k^p^ (2) will result in a slower rate of convergence. 

However, when g is unknown and thus estimated, "02. fc" (2 ) ~ "0 does not attain 
the optimal rate of convergence because the estimation bias n 1^3^+^ d+w^ d+w^J 



exceeds 



n i^+d. The estimator ip3,k^^^^{2 ) = V'2,fc»^,(2 ) 

4/3 

fails to attain the rate n 4^+^ because it has variance of the order of 



^opt (2 ) k^pt (2 ) 



= 



Jll + 4l3/d 8|3 



which exceeds O 



(^n WTd^. On 



the other hand, ip 



3-K,A^ ) 



4|j 

has bias of 0„ n '^1'+'' 



because the truncation bias is Op ( n ^^+^ | and the estimation bias 



On I n, KWPFd-^d+Wl^d+W^, 



is also Op (n ] under our assumption (4.4). Our strategy will be to try to 

in the estimator ^3,fcSpt(2 ) = ^2,k^^^^{2 ) + 



replace the term U3 ^0°''*' '''''opti'^ )^ ; 



(si,S2)sr2 



BPhA , Z,, {X,,)Zs2 - / [.51 = .S2] 



where is a subset of the rectangle ^0°"'^^ '''^opJ2 ) 

8/3 

n 4,3+d but the additional bias 
'fe„%*(2 )Mpt (2 ) 



3 







= E 



E 



U3( (o°"^'^'^°^0^^ )\^^ 



E 



^il^si {X.i^) 



(si,s2)e(o/ 



such that var 



r (&3 in)) 



BPHi 



^Sl (-'^j2) 



XZs2 (^»3)^i3 



(fc(0),fc* (0)), satisfying Op 



k (O)"'^" k* {0)"^" 
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is Op ( n~ '^TH ^ . This approach will succeed if we can chose 57 and thus 

to be sums of rectangles (whose number does not increase with n) such that 
(i) each rectangle in ( q"*"' ' "^'^^ ' \ \f2 has its closest vertex to the origin, say 

< ji iii+d g^Yid (ii) simulta- 
neously each rectangle in 17 has its furthest vertex from the origin, say (fc (1) , k* (1)), 
satisfying O {k (1) k* (1) /n^) = O (^n^^FR^ 

We index the vertices of our set of rectangles as follows. Consider a natural 
number J and a set of non-negative integers JCj^tot = {k-2, k-i, fco, . . . , k2j, A;2j+i, 
^27+2} satisfying = fc_2 < fco < ^2 < • • • < ^2,7-2 < ^2.7 < ^2.7+2 = ^27-1-1 < 
^"27-1 < ■ ■ ■ < ki < fc_i. 

Note the elements with even subscripts increase from to 2 J -I- 2 while elements 
with odd subscripts decrease from —1 to 2J — 1. Further the smallest element with 
odd subscript equals the largest element with even subscript. Wc will use two such 
sets ICb.j^tot and K.p,j_tot with corresponding elements k^i and kpi with /ct.-i = fcp.-i. 
Set for s e {-1,0, J} 

(4.5) h,2s+l = nr3^/kp,2s+2, 

3d+4/3 

(4.6) kp,2s+i=nr3^/kb,2s+2, so 

kp,2s+lkb,2s+2 kb,2s+lkp 2s+2 - 

We leave J, ICp^j = {kp^2s, s = 0, . . . , J + 1}, and ICbj = {kb,2s, s = 0, . . . , J + 1} 
unspecified for now but derive optimal values below. 
Let = (/Cpj, ICbj ) be the union of rectangles 



n/ir V \ J I I Ap.2s-l.fe!j,2a \ I I f kp,2s,k), 2b-1 \ [ i i Ap.S J + 1 ^6,2./+ 1 \ 

S2(/CpJ,/Chj) - I (J .fc..2«-2 jUU,2,-2 fc.,2. jjU|,fep.2J fe..2J J' 

The points (fcp,2s+i, ^6,25+2) , (fcp,2s+2, fch,2s-i-i) for s = -1,0, J -t- 1 lie on a 

(fi: i"2) \T1r2 = n > shown in Figure 

1 for J = 2. The set 17 (/Cpj, /Cbj) C (0°"*^^ ^' ^) lies below i/y. 
Define 

^Z,{Kpj,K,j) = V'2,fc_i + U3 {ICpJ,ICbj)) . 

We then have 

Theorem 4.4. (i) T/ie estimator ^^^^ y^^^^j /las variance of the order of 
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k b,-1 



'W/'W 

-.^ - i 

■y,'/A/,'/y 



k b,3 



k_b,5 

k_b,4 
k b,2 



k b,0 -« 



■fyw/4 



I 

■^W4 

■/AW///, 




Mm 



A///////, 



Omega(K_p,K_b) 
>t3 Omega(K_p, K_b) Complement 



.\\\\\V^\1.\^>;-C^N> v\ 



.^^^^\\\^^^^•^■■^;\• 



:\\\\\\\\\;- 



''.^Z. \ 



■ % 



i i j I j ) i i ) |i' |i ' J."J."J ' ' J."J."J ' ' J. ' ' J. ' ' J> |i ' J. ' ' J. ' ' J. ' ' J. ' \ 



^3 



.^ff ff ./f ff. ff ff ffiff f ff ff ff fff^fff ff ff ff ff f ff ff fj- fj 



k_P,0 k_p,5 



r 

k_p,1 



k_p,-1 



K_p 



Fig 1. Hyperbola Hy and Associated Rectangles. 



and bias E (^V'3,(Kp,,X,.,) ) - ^ of order 

J 



Op < n ^'^9+'' 



„-Pt/d ,-l3i/d,-l3p/d\ 



,-(/3p+/36)/d^ 



n to the variance of 



Proof. Each of the 2 J + 1 rectangles whose union is fl {ICpj,ICbj) has {kp^2s+i, 
kb,2s+2) or (fcp^2s+2, ^b,2s+i) for somc s € {—1,0,..., J} as the vertex furthest 
from the origin and thus contributes p,^'+'^^b,2s+2 

V'3,(/Cp,7X,j)- The variance of ■02,/c_i ^ Now 

= {e {iPs,,_,) - yj} +I^E[v3{n{{ICpj,ICb.j))} 
= Op (^kZ[^p+(^>'^/'^'j + Op U"(wT3+d^+dTw) 



+ E 



(Sr"Y ]\n{{iCpj,iCbj)) 



As is evident from Figure 1, W{ICpj, JCbj) = (g ^' ''o^ ) \^ ((^p,/, ^6j)) is the union 
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Of rectangles uf^^ {(^---' J--) U {^^'^ \--+^)} which have 

{(fcp,2s, fcfc,2s+i) , ikjj,2s+i,h,2s) ; s e {-1,0, . . . , J}} 



as the set of vertices closest to the origin, leading to the expression for the bias 
given in the theorem. □ 



Theorem 4.5. Given {/3b,Pp,Pg) with (3p > (3i, so A > 0, Equation (4-1) holds if 
and only if there exists J, JCpj, JChj such that tjj^ ^^^^ — — Op (n'^ ^ . 

If Equation (4.I) holds, E U3 | (^07'' ''0' ) \^ ((^p.^' ''^w))} = Op (n^^FR^ 

and thus ''Ps^^K.pj.Kbi) ^ ~ Op {vT *f*+£i ^ ^ w/ien we choose J to be the smallest 
integer such that 

W ^ ^* ( R R W Y^-'+l n _L A^'^l \ 

2(l+4/3/d) 



(1 + A) ( J + 1) + c* (/3„ /3, A) EtY (1 + A)'"' > ^7TT§75T ^^^^^ 



*r/? . M^/^ '^Pald VA + 2) 2(A + 2) 3 +4/3/rf 
c IP9,P,^J \2(3Jd+l ) Ap/d 4/3/rf+l ^ (1 + 4/3/d)' 

fcfe.o = fcp.o = kf, 2s ~ kp 2s = n^^^'^^''n'^'^i=i^^~^^'^ for s = 1, . . . , J+ 1, with 

1 = {^Twk - (1 + ^) + 1)} /E;iY (1 + A)'-^ 

Note J does not depend on the sample size n. 



Proof. From Theorem 4.4, for the variance of ■03 (iCpj k^j) Op \ n ^fi+d j, J 

cannot increase with n. Further for the second order truncation bias 
Op (^J-^^^^''^^'^'^ and the square root of the variance of iJ2,k-i both to be 

/ 43 \ 2 
Op in vi+d \ must have fc_i = fc^pf (2) = n^+^i^^'' . It then follows from Equa- 
tions (4.5) and (4.6) that kp^ = kbfi = n. 



In order for E 
s = 0,...,J, 



V3{[o:''''o')\^(('^pJ^'^bj))}] = Op (71- JFT^), we require for 



(4.7) {k;l^'/'kJX{'} < n-W^ 

(4.8) {k;'2'/'k-l^l{'] < n-W^ 



Substituting for fcb.2s+i in Equation (4.8) using Equation (4.5) and recalling that 
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Pp > f3b SO A > 0, we obtain 



8/3 



<^ 1 < -iliriZl < nV2/3s/d + l ;2^J7d„ 4,3/d+l 2,3(,/d / ^ (l+4fj/d) 

p.Zs \ 



_2fig^ „(d+4/J) , 

(4.9) n ^^^fc^as Si ^ < n~^^ 

[ Kp,2s+2 J 

ORUrl ^'^"''^ 8/3/d 2/9 /rf / 3 + 4/3/d X 2/3b/d 

^ S 2/+2 < n^^^^ n-'^^T^kp^' j 

I J£gjd_ \ 1 8/i/d 1 / 3+43/d \ 

fcp,2s 

(4.10) 4^ 1 < %2f±i < „c*(/3,,/3,A)^A^^ 

^ 1 < „c*(/3,,/3,A)„A 

^0<c* (/33,/?,A) + A 

since n ^ ko < kp^2s < fcp,2s+2- 

Solving the last expression for ''^^ obtain 

l-4;3/d 



^^■''^ 2/3,/d+l- ^ l(A + 2)/^'' + '^l + 4/3/d' 

which is Equation (4.1), except with a nonstrict inequality. We have just deduced 
that the constraint (4.11) was due to restriction (4.8). We have not yet considered 
whether the restriction (4.7) implies additional constraints. We now show that it 
does not. Specifically if we set kp^2i = fc&,2; for alU S {1, 2, . . . , J + 1}, then equation 
(4.7) is true whenever Equation (4.8) holds because of our assumption that A > 0. 
Thus we can set ICpj ~ fCbj- 

Thus we have shown that if ^'3 (k:^,, k^, ) ~ '4' ~ Op ^72^ ^ , then k^i = 
2 

,^1+4/3/d^ (4.11) holds, and J must not increase with n. 

We next show that when the inequality is strict in (4.11) and Equation (4.4) 
holds, we can find JCj = fCpj = JCbj for which tp^K, ^ = Op (^n~^i^^. We 
then complete the proof of the theorem by showing that when (4.11) holds with an 
equality, there is no choice of ICj for which tp^ converges at a rate better than 
Op (^{logn)n-^y 

Suppose the inequality is strict in (4.11). Since fco = n. Equation (4.10) ap- 
plied recursively suggests we define fcas = n^-'^+'^^^n'^ i0g^l^^^)^i^-^i'^+'^) j-qj. g _ 

3d+4/3 

L — 

/ 3d + 4/3 I 1 

^27+1 = fej+2 = n'- (''+'''3) / 2 as required when /Cpj = ICbj. Instead we use the mod- 
ified algorithm given in the statement of the theorem which insures that fej+i = 

3+4ff/d 

^2.1+2 = n 2(i+4/3/d) ^ as required. Since J is not a function of n, in order to show 

— 4/3 

■03 converges at rate n '^^+^ , we only need to check the bias. 

Now ^ = „(l+A)^</(l+A) = -i ^ fc(l+A)^,(l+A)-i < ^(l+A)^c*(/3„/3,A)(l+A)»-i 

since q < c* {(3g, [3, A) so the bias of ■03 is Op I n~ vj+d j ^ as required. 



1,..., J + 1 and take fc2s-i-i = "^^ . However, this will not generally give 
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Suppose now the equahty holds in Equation (4.11) so c* {(3g, /?, A) + A = and 
continue to assume Eq. (4.4) holds. We now construct an estimator ip^j^j that 
converges at rate Op (n^ In (n) j and show that no estimator in our class ■03 ^ 



converges at a faster rate. We conjecture this rate is minimax when the equality 

3d + 4,a 

in Equation (4.11) holds. Again fc2s+i = "kt^H^ ^^'^ ^^'^ previous arguments, 

2 f 3d+4/3 ~| 1/2 

ko = n,k-i — n /d) ^ fc2j+i = fc2j+2 = <n('^+*'3' ^ . We can suppose that 

k2s = n{v {n)Y . It remains to determine v (n) and J — J (n). We know J (n) must 
satisfy 

{ 3d + 4p 'I 1/2 , -. 

n<''+«^)| =n{vin)y^''>+' so 



z; (ti) = nV2(<i+4/j)^ 



8/3 



The variance of ■03 ^ is of order n '^i^ J (n) . Thus the order of the bias will still 
equal that of the variance provided we multiply the RHS of Eq. (4.9) by J (n). Then 

Equation (4.10) becomes 1 < < n'''^'^^''^'^^k^2sJ . Since, = 

V (n) and n = ko < fcp,2s, "we substitute n'^ = k^ for fc^2s i'^ the modified Equa- 

1 ( 3ei + 4/i ,\ 1 1 

tion (4.10) which gives w (n) = J(n)2/3/i . Hence nV2(d+4^) j j(„)2^/<i 
which implies that. 

(4.12) l^ = 0(ln[J(n)]). 

J (n) 

To minimize the variance, we want the slowest growing function of n that satisfies 
Equation (4.12), which is J (n) = ln(n), as claimed. □ 



4-.1.2. Case 2: The estimation bias oj the third order estimator exceeds the 
optimal rate 

In this section we no longer assume that the estimation bias n v^^j+d 'dTWZ d+w^j 

ill 

of a third order estimator is less than n WTd . Then even when Equation (4.11) 
holds with a strict inequality, ■03 does not achieve a n *p+<' rate of convergence 
because the fourth order bias n V2/3g+£i <i+2/3i, 'i+2/3p ; g^^-^gg^jg ^ iFFs . However, we 
will now construct an estimator 'fp'f^'^ = '>p'f^'^ {(3g, (3p) that under our assumptions 

4|3 

(Ai)-(Aiv) does converge at rate n whenever {Pg,l3b,f3p) given in assumption 

(Aiv) satisfy Equation (4.11) with a strict inequality. Because the estimator is very 
complicated, we have chosen to only define the estimator and give its properties 
in the text. The motivating ideas for and the formal proofs of these properties are 
provided in the appendix of our technical report. 

To define the estimator, we need some additional notation. Define 

a„((oSS!,i</<m-i) 

/ m-l \ 

— V„i I eii^k{l,0),ii 11 \^-°-f^-"l^fe(M-l,0)^fc(«,0) — -^k^-ixk^ J ^fc(m-l,0)^ir,i I 
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where, fc„ = fc (u, I) - k (u, 0), /fc„_ixfe„ = (^ij )fe„_ixfe„ "^'^^^ = I {i = j)- 

Then define U„ as ({1)1^^^, ,l<l<m-l). U^^' f ) is de- 



\k{0) J ^™ \y'-^k{a) > :i ' ^ -^y • ^ni \^fe*(0)'A;(0) 

fined as {iltk(ii] , 1 < / < m - with fc 1) = fc (1) , k (/, 0) = fc (0) for / ^ 



and k{u,l) = fc(u,0) = k*{0). Next 0^;^^"+') (fc^(o)'fc"(o) 'fc(o) ) '^^^'^^^ 

as U„ (^(Ofe(,'o) , 1 < ' < ™ - l) with A: (Z, 1) = fc (1) , A: (l, 0) = fc (0) for / 7^ u and 
/ 7^ It + 1, fc (u, 1) = fc* (1) , fc {u, 0) = fc* (0), fc (it +1,1) = fc** (1), k (u +1,0) = 
fc** (0). We wiU use this notation for m = 3, even though v'^^'^^ (fc*(o)'fc"(oj 'fc(o) ) 

does not depend on fc(0),fc(l) and is equal to U3 (^fc*(o):fc.*(o) ^ of the previous 
subsection. 

FinaUy, define 



ti-i 

u=l 

fTfi — \^rr("!"+i) A2.7+1 ^2.7+1 fco A 

u=l 



Theorem 4.6. Given {f3g, /3f,, /3p) satisfying Equation (4-11 ) with a strict inequality, 
define 

(4.3) A.^,) . „, |(^ - ^ - ^) (2 + -i) + 1} + 1 



i^£Ii j " 12 d+2i3^ d+2iij, <^n^'dTW, where 
/3 = ^^^Y^. Let ICj, J, -03 6e as in Theorem 4-5 and define 



^f'3x.,+ E (-i)^'"'tt:+^ ^ (-1)^-1 (G(.,.) 

v—4: s—1 i;— 4 

ni(/3,,/3b,/3p) 
•u=4 

= V„.i (HiBP + iJai? + i73^ + ^^4) - 



v=3 s—1 f;— 3 3 
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Then 



I 



2/3/d /logn\ j,-/36/d, -A 



^2s 

2/33 

" ti+2/3g ,-2l3/d 



2g3 



'p/d ( logn\ 
hi ' \^ n y 

(^) 



(™-l)£g 



d + 2/3p 



l<s<J 



V 



2/3/d ^ log K ^ 3+2^ 



2|3g 



^2s 



log n 



<i+20g , -2/3/d 



Vd / log n ^ 



( "'-l)/3g 

d+2f3g 



-d+W7 ,-Pi/d,-l3p/d 
'^2s+l '^2s ' 



^ Or, in 'i+lP 



and 



fc-1 k2sk2s-l 



'^2,7+1 



Inference. Elsewhere, we prove that i/j'fcj iPg^ Pb, Pp) is asymptotically normal. 
Here, to avoid the problem of unknown 'constants' for confidence interval construc- 
tion that we discussed in Section 3.2.5, we will construct nearly optimal rather than 
optimal confidence intervals. We suppose that Equation (4.11) holds with strict 
equality for the (/3g, /3p) associated with the parameter space &. Then there ex- 
ists e > such that for all < ct < e, {Pg,Pb — a, f3p — a) satisfies Equation (4.11) 
with strict equality, 



supers 



Eg 








varg 









and 



Let 



supers |vare (/3g, /?h - cr, /3p - a) |? | 



Op(l) 



fj d + 4(/3-<T) ^ 



^lijiPg^Pb 1 Pp ) be a uniformly consistent estimator of (the properly 



standardized) varg -iplzfif^giPb ,Pp )\t 
Section 3.2.5. Then, for all tr < e, 



constructed in the same manner as in 



'W 



V^^/(/35,/36-a,/3p~(7) 



converges uniformly in 6' e to a iV (0, 1). Moreover, 

{(3g, -a,Pp-a)± zjm [V^^/ (/3„ -cT,Pp^ a) 



is a conservative uniform asymptotic (1 — a) confidence interval for tp (9) with di- 
amcter of the order of n d+Mfi-") . 



394 



J. Robins, L. Li, E. Tchetgen and A. van der Vaart 



Remark 4.7. If Equation (4.11) holds with an equality and /Cj, J, "03 k:/ ^'^^ 
as in the final paragraph of the preceding subsection then the proof of Theo- 
rem 4.6 in the appendix of our technical report implies V'k;^/ (Pg, Pb, Pp) — V' (^) = 
Op ({logn) n^^T^^ 

5. Adaptive confidence intervals for regression and treatment effect 
functions with unknown marginal of X 

In this section we describe how to construct adaptive confidence intervals (i) for a 
regression function b (X) = i? [y |X] when the marginal of X is unknown and (ii) 
for the treatment effect function and optimal treatment regime in a randomized 
clinical trial. 



5. 1 . Regression functions 



Example la (Continued). Consider the case b = p, O = {Y,X) with b{X) = 
E{Y\X). As usual, we assume for all 9 G Q, b{-) and the density g {■) of X 
are contained in known Holder balls H{Pi,,Cb) and H {Pg,Cg). Redefine (0) = 



(b{X)~b{X) 



where b{-) is an adaptive estimate of 6 (•) from the training 



sample and expectations and probabilities remain conditional on the training sam- 
ple. Adaptivity of b (•) implies that if b{-) G 6 is also contained in a smaller Holder 
ball H {P*,C), P* > Pb,C < Ch, then 6 (•) wiU converge to b (•) under F (•, 9) at rate 

/ 13' /d \ 

Op n i+2i3*/£i j. Robins and van der Vaart showed that, when the marginal 
density g (x) of X is known, the key to constructing optimal (rate) adaptive confi- 



dence balls for b {X) was to find a rate optimal estimator of Ee 



(6(A) -S (A) 



We shall show that their approach fails when the marginal of A is unknown, but 
that a modification described below succeeds. Specifically, if 6 (•) e 9 lies in a 
smaller Holder ball H {p* , C), P* > Pb,C < Cb, our modification results in honest 
asymptotic confidence balls under F{-,9), 9 G Q, whose diameter is (essentially) 

of the same order 0„ I max < n 1+23* /d , n ''+'^fb )■ \ as the diameter of Robins and 



van der Vaart's optimal adaptive region or ball, provided either (i) Pb/d > 1/4 and 
Pg/d > or (ii) pb/d < 1/4 and Equation (4.1) holds with P ^ Pb. This order 



0*/d 



is the maximum of the minimax rate n 1+2/3* /d of convergence of 6(A) to 6(A) 
were 6(A) known to lie in H{P*,C) and the square root of the minimax rate of 

'6(A)-6(A)''^ 



convergence of an estimator of Ee 



in the larger model A4 (O) 



with 6 (•) and g (•) only known to lie in H {Pb, Cb) and H {Pg, Cg). 

The case where Pb/d < 1/4 and Equation (4.1) does not hold will be considered 
elsewhere. 



Now, since Eg 



^P{9) 



6 (A) 6 (A) 



Eg 



b{X)Y 



Ee 
Ee 



6 (A) -6 (A) 



{b{X)Y 



-2Ee 



b{X)b{X) 



Eg 
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has first order influence function IFi,^, (6) =Y [H {b, b) — tp (6)] where 
H (b, b) ^ b^ {X) + 2b {X) [y - fo (X)] ^2b{X)Y+P (X) , 
so Hi = -1, H2 = i?3 = Y, Hi = -2b{X) Y+P (X). Thus H {b, b) for Eg \b {xf 



differs from H {b, b) for Eg 



b{X)~b{X) 



only in i74. Since the truncation 



bias ipk [d) ~ "ip (d), higher order influence functions of ipk (0) and estimation bias 
do not depend on H4, it foUows that TBk(e) ,W (9), ~ , and EB^iO) 



are identical for ?/> (6) = Eg 



b{X)-b{X) 



and ip (9) = Eg b (X) 



. In con- 



trast, IFi^^ 19) is identically zero for ^(9) = Eg 



b{X)-b{X) 



^{9) = Eg b{XY . Thus, by Theorem 3.21, for ij} {9) = Eg 



but not for 
2" 



vare 



-(-) 



li k > n and m > 1, and var^i 



b{X)-b{X) 



Q \i k < n and 



TO = 1. In the case when k < n and 7ti > 1, by the Hoeffding decomposition, 



vare 



vare ^ B,s 




where M 
we have 



vara 



is a sth order degenerate U-statistic. Further by Theorem 3.21, 



max varo 



as vare 



i (k. 

n \n 



, vare 



for any s > 2. Moreover, 



since the kernel of ] 



vare 




b{X)-b{X) 



n 



is of order Op b {X) — b (X) ^ . In summary 



vare 



m,ipk 



b{X)-b{X) 



( _JiM^-i k 

= max n 1+23^ ^ 



if fc < n and to > 1. (In contrast, for t/i {9) = Eg b{X) 



varg 



i if 



k < n). Thus, if f3b/d > 1/4, (i) has kopt {rriopt) of O 
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where n^+Wd < n comes from equating the order k '^l^b/d TBf. (6) to the order 



fc/rt^ = n T+4p;7d <^ n ^ oi the variance and (ii) niopt is the smahest integer 



such that the order n V 2/3g+d 3+23^; of 



EBrii — O-n 



is less than the order n ^+Vb/d of the standard error. It follows that, for (3b/ d > 1/4, 
in contrast to -0 (9) = Eg b (X)^ , we can estimate ip (6) = Eg (b (X) — b (X)^ 

at (the minimax) rate n i+^'^'t/'' which is faster (i.e., less ) than the usual parametric 
rate of n^^/^. 

. 2" 

When (3b /d < 1/4 , the minimax rates for ip (9) = Eg 



b{x)-b{x)y 



and 



ijj (9) = Eg b {X) are identical and, when Eq. (4.1) holds, it follows from Theorem 



ieves the minimax rate of n i+^fb/^ > n ^1"^ . 



4.6 that i(3g,Pb,(3b) achi 

Henceforth assume either {i)(3bld > 1/4 or {ii)(3b/d < 1/4 and Equation (4.1) 
holds. Pick an e so that Equation (4.1) holds for {(3g, (3b — e, (3p — e). Let < a < e 

and define = ^(a) = ^^™„,„{fc„,,(,„„,,)}^+" ^^^^ 
W* = W* (cr) = W 



if (3b /d > 1/4 and i'* = -0^^/ {Pg,(3b - cr, (3p - a) and 



W* = W 



^k! (/3g, /3b - cr, /3p - cr) if (3b/d < 1/4. Note W* is Op ( n 



uniformly over 0, where Q is the parameter space with smoothness parameters 
{(3g,(3b). Then, by Equation (4.1) and results in Section 4.1.2, as n — > oo. 



inf Pr 



> 1 - a. 



{r-ti'{9)} 

Thus, if -0 (9) were a function of 9 only through b (•) so tp (9) — ^ (b), the set 
(5.1) {b*i-);^j{9)<r + ZoM*} 

would be an uniform asymptotic (1 — a) confidence region for b{-). However, for 



(0) = Eg 



b{X)-b{X) 



, this approach fails because V' {&) also depends on 9 



through the unknown density g {x) of X. This approach succeeded in Robins and 
van der Vaart [lit] because g [x) was assumed known. 

We consider two solutions. The first gives (near) optimal adaptive honest inter- 
vals. The second would give honest, but non-optimal, intervals. The first solution 



is to replace ip (9) with its empirical mean ipemp (b) = 
Equation (5.1), 

,2" 



{bix)-bix)y 



ipcTup (b) -i{}{9) = Op 



[b{x)-b{x)y 
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uniformly in G 8. It is straightforward to check that for all /3f, > 0, n v rf+^^t ^ 
n . Thus, for cr < e, | V'* - V'emp (^) [ / 1 V'* - i/' (6*) [ = 1 + Op (1) uniformly 



over 6* G 8, so inf^eepre 



(5.2) 



> -2. 
2 



> 1 — a and 



is a uniform asymptotic (1 — a) confidence region for b{-). Moreover, if b{-) € 9 
lies in a smaller Holder ball H {P*,C), (3* > (3b,C < Cb, then, under F{-,9), the 
diameter 

_ .1/2 r / ^K--) \ 1 



Op max-; n i+2fi*/<i,n 



1/2 



since i/" (0) = Op ri i+2/3V<i and ip* - ?/' (6*) and W* are Op n 



Eg 



The second, non-optimal, solution would be to replace the functional "0 (9) = 

(b{X)~b{X)y with i'{b) = J |6(a;) - &(.T)|^da;. The functional V (6) is 

the first functional we have considered that is not in our doubly robust class of 
functionals. Arguing as above, if we can construct an asymptotically normal higher 
order [/-statistic estimator V'* that converges to ip (b) at rate on A4 (8) and a 
consistent estimator W* of its standard error, then 



&*(•); f {b{x)-bix)y dx<ii* +zj 



would be an honest adaptive confidence interval of diameter 



Op l^max <! n ^+-v'/d ^ n""/^ 



We conjecture, based on arguments given else- 



where, that the minimax rate for estimation of 



ceeds 0„ 



Vb 



^(6) ^ j{bix)~bix)y 



dx ex- 



whenever 



2/3g/d+l ^ {l+4P/d){l+20/d) 



Since 



(l+4/3/d)(l+2/3/d) 



> 



i+tp^/d ^/'^ all /3 > 0, it follows that, when the marginal of X is unknown and 

I3g/d ^ l-4/3/d 



/3/d 



> 



jP/d, intervals based on V 

2 



[b* {x)-b{x)y 



{l+40/d){l+2P/d) ^ 2Pg/d+l 1+iP/d 

will, but intervals based on / |^ (x) — b (x) | dx will not, have diameter of the same 
order as the optimal interval with the marginal of X known. 



5.2. Treatment effect functions in a randomized trial 

Example 4 (Continued). Consider the case b — p, Y — Y* w.p.l so we have 
data O = {F, where A is a binary treatment, Y is the response, and 
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X is a vector of prerandomization covariates. The randomization probabilities 
TTo (X) = P{A= 1\X) are known by design and b{x) = E0{Y\A = 1,X = x) - 
Eg{Y\A = 0,X ^ X ) is the average treatment effects function. For 9 E Q, 
b{-) and the density g {■) of X are contained in known Holder balls H{f3t,,Cb) 
and H {(3g,Cg). Suppose we have an adaptive estimator b{-) of b{-) based on the 



training sample constructed as described below. Now, since Eg 



Eg 



b{X)Y\A=\ 



- Eg 



b{X)Y\A = 



has influence function 



b{X)b{X) 



l-A 



b{X)], where erg (x) 



Yb (X) - Eg b (X) b {X) = (A - TTO (X)) a^^ (X) Yb (X) - Eg 



Eg 



b{X)-b{X) 



first order influence functions, indexed by arbitrary functions c(a;), IFi ,0 i 
IFi,^ (6*) =Y[H (6, 6) - V {0)] with 

i/i = 1 - 2A {A - ^0 {X)} do-' {X) , 
H2 = H, = {A - TTO (X)} a^^ (X) Y, 

Hi^{A- TTO (X)} c (X) - 2 (A - TTO (X)) ao^ (X) Yb{X) +P (X) . 



b{X) X 

has 

c) ^ 



Thus H (6, b) for Eg 



b{X)-b(X) 



differs from H (6, b) for {0) = Eg b {X) 



only in H/^. It follows that all the properties of the confidence ball 5.2 for b{-) = 
Eg{Y\ X ^ ■) in the setting of the last subsection remain true for b (•) = Eg{Y\A = 
1, X ^ ■) — Eg {Y\A = Q, X = ■) in the setting of this subsection. 

Now define db* (x) = / [b* (x) > 0] . Then it then follows that an honest 1 — 
a uniform asymptotic confidence set for the optimal treatment regime dopt (•) = 



/ [b (•) > 0] is given by <^ 4* (•) ; V 



{b* iX)~b{X)y 



< iIj* + Zo,^ 



Adaptive estimator of the treatment effect function. One among many 
approaches to constructing a rate-adaptive estimator of b (•) is as follows. Split the 
training sample into two random subsamples - a candidate estimator subsample 
of size Tie and a validation subsample of size n„, where both ndn and are 
bounded away from as n ^ oo. Noting that 

= Efl [{Y - Ab {X)} q {X) {A - TTO (X)}] 

for all q (•), we construct candidate estimators of b (•) as follows. For s = 1, 2, . . . , n— 
1, let JCs be the solution, if any, to the s equations 

= P, [{Y - AJiJlp, {X)}lp, {X) {A - TTO (X)}] , 

where ipi {X) , ip2 (X) , ... is a complete basis with respect to Lebesgue measure in 
R"^ that provides optimal rate approximation for Holder balls and Pc is the em- 
pirical measure for the candidate estimator subsample. Our candidates for b (X) 
are the b^^^ (X) = lpg{X) Us- Robins [IG] proved that b{-) is the unique func- 

. In fact, 



tion b* (•) minimizing Risk(6*) = Eg 



{X){Y-[A-TTo {X)]b* {X)Y 



the candidate 6^^^ (X) in our set for which Risk(fe('*M is smallest is also the 



candidate that minimizes E 



b{X)-%''i {X) 



since RiskfS^'*) ) -Risk(fe) 
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E 



h{X) -&(^) {X) 



Specifically, 



E 



7„-2(X){y-[A-7ro(X)]&(-^) {x)Y 
-a^' {X){Y-[A-no{X)]h{X)Y 



= E 



a^^ (X) {A TTo (X)) [b{X) {X)) X 

2 {Ab (X) - E {Y\A = 0, X)) -{A-TTa (X)) (b (X) + fet^' (X) 



E (X) {A TTO {X)) A [b (X) - 6'^) (X)) ' 

(X))' 



We use these results to select among our candidates by cross-validation. Let 

2" 



&(•) be the (•) 



minimizing . 



<7o-2 (x){r-[A-^o {x)]b^^^ ix)y 



l,2,...,n — 1, where P„ is the validation subsample empirical measure. If b{-) 
were known to lie in a Holder ball H{(3,C), it is easy to check that the candi- 

date b^'^' (•) with s = [n^^+^ J obtains the optimal rate of n^^+r for estimating 



E 



(b{X) -fef'^) (X) 



Since the number of candidates at sample size n is less 

than n, it then follows at once from van der Laan and Dudoit's [23] results on 
model selection by cross validation that 6 (•) is adaptive over Holder balls. 



6. Testing, confidence sets, and implicitly defined functionals 

In Example Ic of Section 3.1, we considered the following problem. We were given 
a functional ip (r, 9) indexed by a real number t and the parameter € 0. The 
implicitly defined- functional t (9) was the assumed unique solution to = V' (''', S). 
We noted that a (1 — a) confidence set for r {9) is the set of r such that a (1 — a) CI 
interval for ^ (r, 9) contains 0. In the following subsection we derive the width of the 
confidence set for r (9) . We then generalize the problem in the second subsection by 
introducing the notions of the testing tangent space, a testing influence function, 
and the higher order efficient testing score. In the final subsection, we show how 
the two earlier subsections are related. 



6. 1 . Confidence intervals for implicitly defined functionals 

To derive the order of the length of the confidence interval for the parameter r {9) 
in Example Ic, we can use the next theorem as follows. Assume Equation (4.1) 
holds and /3 < 1/4. Then we can take the estimator {t) a-nd rate in the 

theorem to be the estimator ip/Cj ^^'^ rate n -iis+i for a very small positive cr 
and conclude that the length of the confidence interval for r (9) in Example Ic to 

be Op (n-^+^y 

Theorem 6.1. Suppose for an estimator ip (t) and functional -ip (t, 9), there is a 
scale estimator W (t) such that n^W (t) w (r, 9) in 9~probability , w (r, 9) > 
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c* > and (^tJj (t) — ip {T,8)j /W(t) converges in law to N(0,1) uniformly for 
9 e Q, T e {t (9) -,9 e 6}. Then, (i) with Za the a~quantile and <&(•) the CDF 

of a iV(0, 1), the confidence set C„ = < t; — Zx_q/2 < < Zi_q,/2 > is a uni- 

\ W(t) J 

form asymptotic 1 — a confidence set for the (assumed) unique solution t (9) to 

ij) (t, 9) = 0; (ii) the probability under 9 that a sequence t = t„ satisfying -0 (t„, 9) = 

a„n~'',a„ a ^ Q is contained in Cn converges to 1 when p > is o(l) when 

p < J, and converges to $ (^zi_a/2 - n,(T(ji),e) ) ~ * (-2i-a/2 - ^(^(e)^e) ) w^en 
p = "f. (iii) //V' ('''7 ^) is uniformly twice continuously differ entiable in r and < cr < 
IV't {Ti9),9)\ < c and \ipT2 {t{9),9)\ < c for constants (cr, c), then {ii) holds for a 
sequence t ~ Tn satisfying Tn ^ t [9) ^ {?/v (r (9) , 9)} ^ Onn^^, a„ ^ a 7^ 0, p > 0. 

Proof, (i) That C„ is a uniform asymptotic 1 — a confidence set is immediate, (ii) 
Now 

^(r„,g) ^ ^(T„)~7/.(r„,g) ^ V;jr„,g) | 
W(r„) W(t„) W(r„) J 

(Tn) 

(iii) Since ,p (r„, 0) = (r (0) , 9) (t„ - r (0)) + i^v^ (r* (0) , 9) (t„ - r (0))' for 
some T* (9) between r (9) and r, we have that "0 (t„, 0) = Onn^'' + Op (a„n^'') = 
a„ (1 + Op (1)) n~P satisfies the assumption in (ii). □ 

Remark 6.2. Under some further regularity conditions, the solution r to = -0 (''■) 
is asymptotically normal with mean r (9) and variance (t , 9) {w (r (9) , 9)}^ 
uniformly over 9 e &, t e {t (9) -,9 e 6}. 





6.2. Testing influence functions and a higher order efficient score 

In the following, we repeatedly use definitions from Section 2, which might usefully 
be reviewed at this point. 

Definition 6.3. m'^ order testing nuisance tangent space, testing tangent space, 
testing influence functions, efficient score, efficient information, and efficient test- 
ing variance: Given a model A4 {O) with parameter space O and a functional 
T (9), define Ai (Q (t^)) to be the submodel with parameter space O (r^) = n 
{9; T {9) = t''}). Thus M (e (rt)) is the submodel with r (9) equal to rt. Define, for 
9 G Q (t^), the m**^ order (i) testing nuisance tangent space rj^"««^*es* (^''''0 ^'^ be 
the m*^ order tangent space for the submodel M [Q {t'')) 1 (ii) testing tangent space 
r^''* (6',rt) to be the closed linear span of IFi_^(.) (9) U r;;,"^'*'^'** {9,t'^), (iiia) set 
pnms,tesi,± ,^t) = |lF*^^*(.)| of testing influence functions to be the orthocom- 
plement of r;;,"'''*'^'** {9,t^) inW™ (9), (iiib) set T^Jd,nms,test,± (^^^^t) ^ |lF*^''^*^^''*| 
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of standardized testing influence functions to be 



{TTiostd.test -pnuis,test,± 

(iv) efficient testing score ES*^^* (6) 



Eg 



std,test^^eff 
m,r(-) ^^'^{•) ^ ^ 



vare 



IF' 



(^) e r* 



to be 



= He [ESf (61) |r 



•nuis.te.st.l^ 



where ESf {Q) = ES'-^fj.) (6*) = var^ {l^'tli.-) (^)} ^ (^)' (^) efficient test- 

ing information to be varg jES^"* (0)}, and {vi) the efficient testing variance to be 

[var,{E§^^*(0)}]-\ 

Further define, for e 0, the m"' order (i) estimation nuisance tangent space 

vz'' {0) to be rr- {6) = {a,„ e r„ (0) ; £; 



cient estimation variance to be varg 



IF' 



eff 



Remark 6.4. For m 



T'Z^'^'^^'^t (6',rt) and F;;,"^ 



f. the testing and estimation nuisance tangent spaces 
(6) arc identical. However for m > 1, r'^^'^'*'^^* i^^'^^) is 
generaUy a strict subset of F""'* (9). For example, if the model can be parametrized 
as 9 = (t, p) and O is the product of the parameter spaces for t and p, the 
pntHs, test (^9^T^'^ is the space of m*-^ order scores for p; however, FJ^"*^ (0) also in- 
cludes the mixed scores that have s derivatives in the direction t and m — s > 1 
derivatives in p directions. It is this strict inclusion that gives rise to higher order 
phenomena that do not occur in the first order theory. 

Theorem 6.5. Suppose E§^"* (9) exists in Um (9). Then for 9 e Q (r^), 

(i) the set of estimation nuisance scores F""** (9) includes the set of testing 
nuisance scores p™'*'*'^** (0, r^) with equality of the sets when m = 1, 



(ii) IF*,j''*(.-) (0) , € (rt) is standardized if and only if E 



if; 



ES^'^''* (9)] = 1 if and only if E 



= 1, 



■^test 
' 7n,r( 



■^test 
' 7n,r( 



(9) ESf (9) 
[lF*;^;'*(.)E§f (61) 

(iv) the set {iFjn .^^.-) (0)} of all m*'' order estimation influence functions is 
contained in |lF^'^^*^j'*| with equality of the sets when m = 1, 



(iii) 



m,r(-) W X 



(v) He 



IF' 



std.test 



■ ^ (0) |rj,r* (e, Tt)J = {w [E§r (0)] V' F§^^* (0) , 

(vi) {varg [ES^*'* (^)] } ^ ES^*'* (9) G |lF^'^^*'^^**| anii /las the minimum vari- 
ance jware [ES*^*' ((?)] }• ^ among members of |lF^*''^*';^^*| . In particular 
{varg [ES^f * (^)] < '"are i^tlii-) (^) ^^'^ equality when m = 1, 

(vii) Given IF^**^.-) (•) S |lF^**(.) (•)|,an2/ smooth submodel 9 {Q with range 
containing 9 and contained in Q (t^) , and s < m, we have 



d'Eg 
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Thus, if Eg 



r(-) 



) is Frechet difjerentiable with respect to 0* to order to + 1 



for anorm\[\\, Eg IF^f*(.) (61 + (56*) = O {\\5e 11"'+^) for 6 and 9 + 56 in an open 

neighborhood contained in Q ij"^^, since the Taylor expansion of Eg IF^'^**^.-) [9*) 
around 9 through order m is identically zero. 

The proof of the Theorem wiU use the foUowing two lemmas: 
Lemma 6.6. For any W^^'l^.^ {9),9ee (r^) 



Eg 



IF^f*(.) (9) ESf (9) 



Eg 



IF*0(.) (0)E§^^* (9) 



Proof. 



Eg 



Eg 



if: 



test 
m,T(-) 



Ug (9) |r: 



:7iuis.test.l. 



Eg 



IF^f*(.)E§*^^* {9) 



where the last equality holds by IF*,';^*(.) € r™'"'*^^*-^ {9, rt) 
Lemma 6.7. For any IF*0(.) (9),9eQ [t^) , 

ng [iF^-:*(.) {9) |r* 

= E [lF*-*(.) [9) E§*r {9)] {var [ES*r' (^)] ES,*r {&) 
= E [lF^;'*(.) {9) ESf (61)1 {var [E§*^"* {9)] ES*;;"* (6i) . 

Proof, r^f * (61, rt) = {cE§*f/* (6*) ; c € © r;;"*^^*'^'^* (6*, rt) . Thus, by 



□ 



{ ()\ r- Tenuis. test. 1. (a _t\ 



^test 
- m,r(-) 



w.) (^) |r* 



IF*-*(.) (0) I {cE§,r (0);cei?i} 



IF^:*(.) (9) E§r (0) {var [E§^^* (9)] y' ES'^^' (9) . 



Now apply Lemma 6.6. 



□ 



Proof of Theorem 6.5. (i) is immediate from the definitions, (ii) and (iiii) follow 
from 



E 



if: 



test 



= l^E 



IF^;*(.) (6i)E§f''* {9) 



= 1 



^Eg 



IF*0(.)IF^^/ {9) 



vara 



iiFi w.) {e) 



where we have used Lemma 6.6. For (iv), note {IF„^^(.) (6*)} C |lF*0(.)| fol- 
lows from the fact that every smooth submodel through 9 in model M (0 (t^)) 
is a smooth submodel through 9 in model M (0). Thus it remains to prove that 
IFr„ ,-(.) {9) is standardized. But, by Part 4 of Theorem 2.3, 



Eg 



IF,„,,(.)(0)IF^^/(.)((?) 



= vare 



IF' 
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(v) follows at once from Lemma 6.6 and Part (ii). For (vi), note that 

{var, [E§^^* (9)] y' E§*^''* {6) e {lF^';*(7*} 
by definition. Thus 

var, ^Ee [lF*-*(.)E§:r (^)l ^Km-)] > {^^r, [ES^ (9)] 



follows from (v). The result then follows from part (iii). Part (vii) is proved anal- 
ogously to Theorem 2.2 except now all scores lie in T^^^ {9) by range 9 {() in 

e(rt). □ 

In the case of (locally) nonparametric models, we can explicitly characterize 
r'r*^^ (61, rt). Let ^Vff'^ (^'^^)} be the set of ah 



with the U!^f'^ (0,rt) = J2^^ W U^ls (O. J 0) £ (9), indexed by 

constants ci G R^, and functions hi^s {Oi, ; 9) satisfying Eg [hi^s [Oi^ ; 9)] = 0. We re- 
mark that the subset ofUj (6) comprised of all jth order degenerate U-statistics can 
i 

be written 



s=l 



Thus |lLJ*'"J*'^ (0, T^) I simply restricts 



one of the hmctions hi^s {O ; 9) to be ciIF^J^ ^ . 

Theorem 6.8. If the model Ai (O) is (locally) nonparametric, i/ien P^''*'''- (0, r^) = 
{E™ 2^*7'^ (0,rt) ; U57'^ (^?,rt) £ {u*^'^ (0,rt)}} . 

Proof. Since the model is locally nonparametric P^'** (^i'''^) includes the set of all 
mean zero first order f7-statistics Ui (9) and thus any element of P^*^**'^ [9, r^) must 
be a sum of degenerate U -statistics of orders 2 through m. We continue by induction. 
First we prove the theorem for to = 2. Now, P|^"* {9,t'<) = Ui {9) + U^f-'^*^"* (9) 
where jg closed linear span of the second order degenerate part 

J2s=£j Sh,jSi.^^s of second order scores S^j^ = Ej Si^i^j + J2s^j ^h-jSh.s in model 
A4 {^Q(t^)), where "^s^j ^h,j'^h,s is a sum of products Si^jSi^^s of first order 
scores in model Ai (0 (t^)) for two different subjects. By model M (8) being 
(locally) nonparametric, the set of first order scores in model Ai (8 (t^)) is pre- 
cisely the set of random variables p™*'*^*'^** (0, r^) orthogonal to IF^-^JI^ ^ (9). But 
the set of degenerate f/-statistics of order 2 orthogonal to the product of two scores 
in r"™'*'*''^* (^''r^) is clearly |U2';2*'"^ (^'■^^)}- 

Suppose now the theorem is true 

for TO, m > 2, we show it is true for m -I- 1. By M (8) (locally) nonparametric 
and the induction assumption, P*^^*i (61, rt) = P*^«* [9,t'') + Z^,*^+*i,„i+i (0) where 
^m+i'm.+i (0) is the closed linear span of the sum of products of first order scores 
in model Ai (8 (t^)) for to + 1 different subjects. But |Um+i^,„+i is the 

set of set of degenerate [/-statistics of order m + 1 orthogonal to Z^m+i'm+i {9). □ 
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6. 3. Implicitly defined functionals and testing influence functions 

In the following theorem we show that estimation influence functions IF„^ (9) 
for the parameter ip (t,-) evaluated at the solution r (6) to ~ i/j (t , 9) is contained 

in the set |lF^''*('.) of testing influence functions for r (6). We also derive the 
estimation influence functions IFm,T{-) (^) = £^=1 IIl^s.s,r( ) {(^) for r (9) in terms 



of the estimation influence functions IF, 
with respect to t. 



rn,ip{T,-) 



{6) for ip (t,-) and their derivatives 



Theorem 6.9. Let r (9) be the assumed unique functional defined hy Q = ip {t (9) , 

9), 9 E Q. Then, for 9 G Q {t^) , whenever IF„ ,^(,^1,.) (9) and IF m,r(-) {&) exist , 

(i) lF„,^(.t,)(0)e {if,'-* (.)((?)}, 

(ii) IFi,,(.)((?) = -Vj-ilFi,^(,t,)(f?) e {IFf^(*f*(^^)} where = dij{T,9)/ 

dT\r=T'i, 

(iii) IF,„,„,,(.) {9) = -i'-^ {lF„.„,^(,t..) (9) + Q™,™ {9)} , where Q™,„ {9) = 
Qm,ra.ri.) W = V{Q„,,,„ (9)} G {U*0/ ^t) } . For m = 2, 



(6.1) 



dIF, 



dT 



l,-r(-),i2 



(0) 



dlF^ , t s 

where I, 



Qm,m (9) is given in the appendix 



of our technical report as well as the general formula. 

Proof, (i) For r < m, consider any suitably smooth r dimensional parametric sub- 
model 9 (C) with range containing 9 and contained in Q (t^) . Let S^^j (9) be any as- 
sociated s^'^-order score s <m. By definition of t {9),tlj (r {9 (C)) ,9 (0) = 0. Hence, 



'JC=e-i(e) 



Now we expand the RHS using the 
chain rule and note that the only non-zero term is the term ip^j (t^ , 9) in which all 
s— derivatives are taken with respect to the second 9 {() in {t {9 {()) ,9 {()); all 
other terms include derivatives of r {9 {(^)) , which arc zero by range 9 {(,) C Q (t^). 



Further ('''^ ^) = Eg I¥^ ,p(^^ t .'f{9)§^:^j {9) by the definition of the esti- 
mation influence function IF„j ^^j^-t^.) (9). We conclude that IF„j .^.(^t..) {0) is in 



(ii) IFi 



is contained in ^^.^ j 
of our technical report for proof 



1,i/)(t 



is straightforward. That IF 



l.r(.) 



td,test\^ follows by Part (iv) of Theorem 6.5. (iii) See Appendix 

□ 



6.4- "Inefficiency" of the efficient score 



We now provide an example to show that, contrary to what one might expect based 
on Part (vi) of Theorem 6.5, inference concerning r (9) may be more efficient when 
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based on an 'inefficient' member of the set 



such as IF, 



(0) 



than when based on the efficient score ES,^'*^(-.') {9). Without loss of generality, it 
is sufficient to consider the case m = 2. In the following example it is Tfc (0) and 
(t^i^*) that play the role of t {9) and ip(^T^,9) in the preceding theorem, be- 
cause Tfc {9) and tpk (r^, 9) have, but r {9) and (r^, 9) do not have, higher order 
estimation and testing influence functions. 

Example Ic (Continued). In this example, with Y* (r) = Y* — tA, A and Y* 
binary, 

^ (r, 9) = Eg [{Y* (r) - Eg {Y* (r) \X)} {A - Eg {A\X)}] 

and r (9) satisfies ?/> (r ,6') = 0. Let Tk (9) satisfy i/jk (tx- (9) ,9) ~ where 
^fc (r, 0) = [r* (r) A] { [He [S (r) \Zk] Hg [P\Zk]]} is defined in^Section 3.1 
with T a real- valued index and B (t) = b {X, r) = Eg (Y* (t) \X). Note ^fe,r (t, 0) = 
di^k (r, 0) /5t = - {i;^ [^2] - Eg [{Ilg [P\Zk] }'] }, V-r (t, 0) = --Be [vare 

T-2 (r, 0) = -0^.2 (r, (?) = 0. Below we freely use results of Theorems 3.11, 3.14, 
and 3.17. We suppose that < cr < varg (A|X) and Eg [A^] < c for some {(J,c), 



< 1/4. Choose k = kopt {2)n 



2a 



^2'^,(T > so the truncation 



bias of ip2.k (t) = 'ijj2,k ('''i^) is Op (n^Wd ) and n"^^ ^ varg i/'2,A: (t) 



k/n 



2 ^ „-2(l||^+-) 



4g 



We assume the given {f3g,Pt,f3p) are such that the order 



of the estimation bias of -02, fc (''■) is Op In *f*+£i 



Then 



02.*: iT)—ipk{T,9) and ?/'2.fc (t) — ip {t,9) arc Op (ji 4/3+d+'^^ -vvhichjust 



exceeds the minimax rate 0„ 



4/3 

n 4/3 + ci 



for a very small. 

Our goal is to compare the coverage and length of confidence intervals for Tk (9) 
and T (9) based on 



C. 



i'2.k T, 9 



l-Q,i/>k(r) 



T; -Zi-a/2 < 



2,Vfc(i-) 



T2.k 



a,2, ES 



< 



2,Tfe 
2,Tfc 



< 21-q/2 



< ^1-q/2 / J 



'(r) 



< ^1-q/2 



where 



2.^fc(T) 



0. 



2,Tt 



, ^6* (T)j are appropriate variance estimators. 



9 is our usual split sample initial estimator, and 9 (r^) is an initial split sam- 
ple estimator depending on that satisfies ^'k [t,9[t^)] = if r = , i.e.. 



T 9 (rt) = rt. We assume that if t (9) = then the convergence rate under 9 of 

our estimator of b {X, r*) for any r* remains n . 

We shall see that the interval based on C, ~ , , outperforms the other two in- 

l-a,'0fc(T) 



terval estimators. The next theorem gives explicit formulae for -02.110 I t, 9 
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ES**^-* (j) (r)^ , and T2,fc (j)^ . Using these formulae we calculate the biases and vari- 
ances necessary to compare the coverage of the three intervals. Before proceeding, 
note the assumption < a < Eg [varg {A\X)], Eg [A^] < c imphes 

\Tk{e)-ri0)\/\iik{r,e)~^/j{T,( 

is uniformly bounded away from zero and infinity. It then follows from earlier results 
on T/)2,fc {t), the assumption < a < Eg [varg ,Eg [A"^] < c, and Theorem 

6.9 that C, ~ , , is a uniform asymptotic 1 — a confidence interval for both t (6) 

and Tfe (9) of length Op (^1-^+"^ . 

Our comparison requires each of our three candidate procedures to be on the 
same scale. Therefore we used standardized versions of the relevant statistics. 



Theorem 6.10. Suppose the assumptions described in the preceding example hold. 
Then 



(i) 



Y* {r)-biX,r) UA-piX)} 



- V 



{ [y* (r) - b {X, r)] Zl}^^ {Z,[A- p{X)] }^ 



where r) = B (r) ^ E^{Y* (r) \X), p{X) = P = E^{A\X); 
ill) Lett denote Y ~ b (X), and A denote A ^ p{X). Thus, 



V 



X 

where 



E^ 



Eh 



I'—^i 

-1 



El; 



x£o 



+E^ 



Zk.jAj 
Zk,j^j 



and 
Also, 

(6.2) 



V (9) = Eg [varg {A\X)] 
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(iii) 



T2 



where 



with Y = Y* T 



where 



Q 



2,2,Tfc(-),i2 

1 ^1 

= V 

2 



{A p{x)}l V (e)] [{y-bix)} {A p{x)} 

{A piX)}l V (e)] [[Y-b{X)} {A piX)} 



Proof. The proof of (i) was given earlier. The proofs of (ii) and (iii) are in the 
Appendix of our technical report. □ 



Theorem 6.11. Suppose Tk (6) = and the assumptions of the preceding theorem 
hold. Then 





^(rt),rt) 


. 2,2,7fc(-) \ 





varg 



V [0] lj}2.k T, 



varel^var-^^.^{ESlf^^^ 
1 + Op (1) 



1 -1 



(ii) 



varg 



^2,2,rfc(-) 



varg 



V {e) ' V2,fc (r, ^) /varg jra.fc (^) - r^} = 1 + Op (1) 



(iii) 

Wp) \g [V'2,fc (rl 't 
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(iv) 



0,{{P-P) (B(rt)-B(rt; 



2,Tfc 



9{X) 
9{X) 



-i]+(p-P) + (b (rt) - B (rt 



Or, 



max < n 



fib 



d + 2fip I ^ \ £i + 2/3t £i + 2/3p ) ^ \ £i + 2/3t "1" ci+2/Jp J 



(v) 




P -P 



am 



I] B-B 



P - P] { P - P] { B - B 

2 



P -P 



3(A-) 



1 



B - S 



,_2p_ , 



( J±S_\ f>P fib. 

'\d+2/3p ) 



fib fip 



^ 2l3g + d ) ^ d+213^ _|_ ti + 23p _|_ 2/3g+ti 



Proof. The proof of part (iii) was given earlier. The remaining parts are proved in 
the Appendix of our technical report. □ 

We conclude from this theorem that the savings in variance that comes with 
using E§**^~*^ ^ (^9 (t^)^ rather than ?/)2,fc ^) asymptotically negligible even 
in regard to constants. Similarly, we conclude that the difference in variance that 
comes with using ip2.k (^t, 9^ rather than IF^ 2 7.( ) (^j asymptotically negligible, 

again even in regard to constants. Further, because varg 

and vare 



Hj*,iesi,_L / 


?(rt),rt)" 


. 2,2,7fc(-) \ 





are of the order of o (-) as their first order degenerate 



kernels are both of order Op (1), and n^f+d |V''2.fc (^t,1 
asymptotically normal, we conclude that 



4/3 

n^fi+ 



-'-"{T2.k (0)-Eg[T2,k (0)]} 



'2~(.)'^^-" 



4^ 



2.rfc(.) 

are all asymptotically nor- 



and n^'^v (IPj {v^a.fc (r, ^) - Eg 
mal with the same asymptotic variance. 

It then follows that a necessary condition for the intervals based on ?/'2.fe ( '''^ ) : 



-|ggte£t^ ^ (t) j , and T2^k [dj — T to cover (6) = rt at the nominal 1 — a level as 
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oo is that 



(e) ^ Eg 



vaK 



and Eg 



T2,k 



(«1 



arc Op [n ip+d+<^ 



Now we know under the assumptions of Theorem 6.11 that this necessary condi- 



tion holds iov V [6] E, 



7 ' 



one and, by assumption, n \'^f>a+'i'^ ^+20^-^ d+20 
necessary condition need not hold for either 



(j)^ is bounded away from zero and 
= Op (^11^-^'^+''^ . However, this 



(.t)H~:,) {Hr'))yEg 



or Eg 



T2,k 



For example, consider the following specification consistent 



with our assumptions: (3p/d = , (3b/d = Pg/d — 1/4. Then (3/d = 1/8, so 



/3g 



/36 



n 



^ 2fi„+d''' d+2fi. + ti+2/3. 



4/3 



However, Eg 



T2.k 



converges 



to zero at rate n ^+2^ = n e . Next 

-1 



var 



■.,{ 



test 
'2,rfc(-) 



\r^))y Eg 



for small a. We conclude that the intervals based on ES**^— ^ ^ ^6* (r)^ and T2^k — 

T fail to cover tj, (0) = at the nominal 1 — a level uniformly over as n ^ 00. 
We reach the identical conclusion with regard to the parameter r {9) because under 

our assumptions |r (6) - tu = Op {n^^^^^'^^. 

Furthermore, by the argument used in the proof of Theorem 6.9, it is easy to 

see that the length of each interval is Op {k/ri^) = Op ^n~*^^'^^ . It follows that 



if we try to improve the coverage of the intervals based on ES' 



(t) ) and 



test 

'T2,k ^ by further increasing the length of the intervals will increase beyond 

Op ^n~4/3+d+'^^ . Wc conclude that the interval based on ip2,k [t, 9^ is strictly pre- 
ferred to the other two intervals when f3p/d = , fib/d = (3g/d =1/4 and is never 
worse in terms of shrinkage rate and coverage than the other two intervals whatever 
be Pp, Pb, and fig. We reach the identical conclusion with regard to the coverage 
of the parameter r (9) because, under our assumptions including our choice of fc, 

\t {9) - Tk {9)\ ^ Op \n ^f+'i 1 and n 'if>+d < n 43+^+'^^ the order of the interval 
lengths. 

These results translate directly into analogous results concerning the associated 
estimators. Under our assumptions the estimator solving ?/'2,fc 9^ ^ Q converges 

to both T (9) and ffc (9) at rate Op ^n~4F+d+'^^ . In contrast the rate of convergence 
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of T2.k (^0^ and the estimator solving ES**^— ^ ^ (j^ = converge to t {9) and 
Tfc (6) at the rates given in (iv) and (v) of Theorem 6.11. 

What is the intuition behind the above findings? First note that, as promised 
by Theorem 2.2 and part (vii) of the theorem in the last subsection, the bias away 



from zero of var- 



Tk{9)],andv[9) Eg 



,Eg 



T2.k 



arc all 0„ 



However the nature 



and convergence rate of the Op 
timators, attaining a minimum for Eg 



term can vary markedly between es- 



Now it is not surprising that, 



for the same order of variance, the order of Eg 



T2.k 



- ru (0) 



often exceeds 



that of Eg 



Confidence intervals for (9) based on T2.fe \^9j are cen- 
tered at (i.e are symmetric around) T2^k which is a quite stringent constraint 

on the form of the interval. In that sense, intervals based on T2,k (j)^ are a higher 

order generalization of the first order asymptotic Wald intervals for Tk It is 
well known that when Tk (9) is an implicit parameter that sets a functional such 
as V'fc {t, 9) to zero, first-order Wald confidence intervals are often outperformed 
in finite samples by confidence sets obtained by inverting a 'score-like' test based 
on first order 'estimating functions' for the functional that depend on the param- 
eter Tk and, frequently, on estimated nuisance parameters as well, although this 
fact is not reflected in the first order asymptotics. Our example is higher order 
version of this phenomenon, where the benefit of the interval C, ~ , ^ obtained 

by inverting tests based on the estimating function '!/'2,fc i^-, &j iov the functional 
ipk {t, 9) is clearly and quantitatively revealed by the asymptotics. Note that, like 
first order Wald intervals, the interval based on r2,fc (jij will differ from the interval 

for Tk (^) based on applying an inverse nonlinear monotone transform h^^ {■) to 
the end points of a Wald interval for the transformed parameter h{Tk (9)} that is 
centered on h {t)^ ^ (jij = h ^tj. (J^^^ +^^2 h(Tk{ )) f^)' contrast, like first order 



score-based intervals, the intervals based on ■02,, 



,fc (r,0) and E§*~*(.) (^(^t)) are 
invariant to monotone transformations of the parameter Tk (^)- 

More interesting and perhaps more surprising is that, for the same order of 

The surprise derives from a failure to recognize that The- 
orem 6.5 is simply too general to help select among competing procedures . For 
example, this theorem implies that under law 9 (t^), (a) the variance 



variance, the order of Eg 



that of Eg 



^2,k{r^,0) 



var' 



exceeds 



(rt) { 



ES'^^',, 19 



of 



var' 



^))yEg 



E§ 



2,Tk 
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is less (and generally strictly less ) than the variance of 



while (b) both have bias of Op I 9 [t"^) 



gest that the estimator solving ESi 



At first blush, this might sug- 
= would likely have the same 



bias but smaller variance than the estimator solving ■02, fe [r, 9j = 0. But we have 
seen that just the opposite is true. The reason is that the difference between 
the variances in (a) is negligible in the sense that their ratio is 1 + Op (1), while 



biases are often of quite different orders with that of 



the Op 

V (j) ('''^) j Eg ip2^k [t^ , S (t^) j always a minimum. Furthermore, the theory of 
higher order estimation and testing influence functions, as a theory of score func- 
tions, is, in itself, insufficient to order these biases. Rather side calculations were 
required. See Remark 4.3 above for further discussion. 

More generally, whenever the functional ip (t, 9) is in our doubly robust class. 
Equation (4.1) holds so V'k;'^/ is rate minimax (or near minimax if a is chosen posi- 
tive), and the suppositions of Theorem 6.1 hold for (r) = "0^;^/ (■'')j Theorem 6.1 
then implies the width of the interval estimator for r {9) based on V'k;^/ (t) converges 
to zero at the convergence rate of tp^^^ (r) to -0 (■'': ^) • 



Appendix 



In the following, we assume all parametric submodels are sufficiently smooth and 
regular that expectation and differentiation operators commute as needed. We also 
define IFi.i to be IFi. 

Proof of Theorem 2.2. Define the bias function Bra[9\9] of W^{9) to be 
Egi [IF,„ {9)]. Define 



where we reserve * for differentiation with respect to the first argument of i?„i [• 
Thus for s < m, 

ipXh^.i^ {9) ^ B,nr,-K [^'^]- 
To prove the theorem we will first need to show that: 



(A.l) 



Bm,ll...l*lj^i., 



[9,9]^0torm>s>j>0 



To this end note that for j < to, 



= [9,9]+B^,-^..,u,^, [9,9] 
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where the second equahty is by the definition of IF„i [9), the third is by the chain 
rule, and the fourth is again by the definition of IF,„ {9). Hence i3m,q...;*;j+i [9, 9] = 
0. Hence for j < m — 2, 



= ai?™,j...n;,+, [9,9] /Sq^^,... 

= B,n,ll...l*l*^^lj+i [9,9] + Bjn,ll...ipj+il,+2 [0,0] 
= + Bm.ll...l*lj + ilj+2 [Oi 0] , 

where the last equality holds because we just proved B„i^i*...i^ij+i [^?, 6*] = for 
arbitrary indices. Iterating this argument proves (A.l). We complete the proof 
by induction on s for some s < m. Given a s = 1 dimensional regular paramet- 
ric submodel 0(<?), Eg^^^^ [Wm{9 {<;))] = by assumption. Hence, by regularity of 
the model, = S^,;.. [9,9] + Bm,h. [9,9]. Therefore B,n.h. [9,9] = (f)- Now 

suppose the theorem is true for s. Then 

-i^\h...h+, (0) = -9VAii...i. 

^dB„u,,„i^ [9,9] /dQ^^, 

~ Bm,l*^^li...l, [^,6*] + -Bm,ii...ia+i [^:^] 
= + B,n.li...l^ + i [9, 9] , 

where the second equality is by the induction assumption, the third by the chain 
rule, and the last by Equation (A.l). □ 

Proof of Theorem 2.3. (1) Consider two influence functions IF^;^'' {9) and W^^^ (9) 

for 



for ^P{9). Then Eg 



V^^j^ (9) - (9) = 



any score S^j (9) , s < m and hence for any linear combination of scores. But, 

by definition, linear combinations of scores are dense in Tm (9). Thus IF^-* (9) and 
IF^^-* (9) have the same projection on {9). (2-3): Essentially immediate from the 
definitions. (4): For t < s. 



i,^j^{9)^Eg W„,{9)S^j^{9) 



Eg 



lim.e [IF™ (9) \Ut {9)] {9) 



for any §^^^(6'). (5. a): follows from (1). (5.b): follows from (4). Degeneracy of 

IF,„m [9) follows at once from the fact that IF^m {9) G Um-i (9)^ in {9). Proof 
of part (5.c) requires the following. □ 

Lemma A.l. Suppose, for m > 1, IF„i,,„ (9) and if^ .^^ym^^^ . {Oi^_^^;9^ 

exist w.p.l for a kernel IF^.m (9). Let f (^O; 9 (C)^ , (^"^ — (Ci, . . . , ^s), denote an 

arbitrary smooth s-dimensional parametric submodel. Let If G {l,2,...,s}, and 
Si^ (O) be the score for evaluated at 9. Then, 

(i) ~*/i,,;/-!'™(o,i,...,o,„;-) {Ot^+i',0)sh (Oi^+i), -ifm,m,\lt {Oil, ■ ■ ■ ,0i^;9), 
and ifm,m {Oi-^ , ■ ■ ■ , Oi^ ; 9) si^ {Oi^ ) each have the same mean given Oi-^ , ■ ■ ■ , Oi^_-^ , 



(ii) E [ifm,ni.\it (On, 



(iii) Eg «/i^j/^«-(Oii,...,Oi,„;-) + l<^»i'---''^»™-2'C'«- 



0, so 



H 
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(iv) IF„,,„,y^ (6) satisfies Ue [lF„,,„^y^ (9) \U,n-2 (0)] - and 



Tie [I^m,m,\l, (e) Un-1 {0)] = 

Proof, (i) By IF,n.m (0) degenerate, 



mEa 



jpsyra _ Q,^ 

m,m,\lt ,»m ^ ' ' '- m L 



" ra^ni^Lt.i-n 



IF 



(0) su (0.„J|0.,,...,0,_, 



Further, by definition, 



Efi 



Ee 



IF„ 
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(ii) By IF^.,n (0) degenerate = Ee (0) si, {O^J \0^„ . . . , 
w.p.l and so (ii) follows from (i). 

(iii) (i) and (ii) imply 



= Ee 
= Ee 



Ee if, 

X Sit {0^^+,) \0^,,...,0^^_^] . 

But, by (Oi^^j) an arbitrary mean zero function, 

[ifi.if^^[Oi-^,...,Oi^i-) . . . ,Oi,„_2| 

= {'^fi,if^_Z{o,-^....,o^^i-) (Oj^+ii^*) . . . , Oi„_2| = 0. 
(iv) By definition, 

Ue [lF„,™,v, (0) \U^., {9)] = V [{/ - d,„,4 {/i^™,„,v,,7„ {0)}' ■ 

The result follows by Equation (2.1) and part (ii). □ 
Proof of Theorem 5(c) (ii). Consider a m-dimensional parametric submodel 

/(0;e^(C)) =/(O;0)|l + ^Oa,(O)|, C^ = (Ci,...,U), 

with Ee [ai (O)] ~ 0. Since this model is linear in the f\ii...i„^ {Oj',9) = for 
m > 1. Hence S^j {9) is degenerate of order m, i.e., E>^j {9) e (6*). Since 

IFm-i (0) exists, on setting Is = s for s = 1, . . . , m. 



(O) / n ^C,lc=o ^ (^) = [lF„,_i {9) (0)' 
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Differentiating the last display with respect to and evaluating at C = 0, we 
obtain 



Now 



Eq 



E, 



IF™-i,V„.(^)§™-i,I,„_,(^) 



IF,, 



7n — 1 



Setting siAO,^,e) = a.(0,J, § „_ij„_^ (f?) = T.^,^■■■^^^_, D "^(O^.;^) is de- 

' r— 1 

generate of order m — 1 so 



m — 1 m— 1 



m— 1 



and 



= 0. Hence 



^^^^ (0) = (m - f)!i?e (^*C-";,„,_i,\i^ (o.,, . . . , 0.„_,; n (O^; ^) j ■ 

Now, by the assumed existence of IF™ (0), we also have {6) = iJg [IF™ {&) x 

§™(0)] = m!£;(, (^i/^^;KO«i,...,O«,„;0) n arCO*.;^)^ It follows that, for any 
choice of 7Ti — 1 mean zero functions a,. (O) under 0, 

U -TnEe [if^yj^{o,,,...,o,^;e)am{o,^;e)\o,„...,o,^_,] 

m-l \ 
r=l / 

= Ee (O,, , . . . , 0.„_, ; 9) J| a, (O,^ ; 0)^ , 
r(O,,,...,O,_,;0) 



where 



rn— 1,1 



The last equality follows from 



'■Jm-l.m-l.\l,„ V"-^'i ' ■ ■ ■ ' "^'^-l ' 
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m— 1 



orthogonal to Y\ {Oi^ ; 9) . We conclude r (0^^ , • • • , Oi„_i ; ^) = with probabil- 

ity 1 because r (O^^ , ■ • ■ , Oi^_^ ; 0) is a degenerate U-statistic kernel of order m — 1 
and all degenerate U-statistics of order m — 1 have kernels that are the (possibly 
infinite) sum of products of m — 1 mean zero functions. It follows that, on a set 
Orn-i which has probability 1 under (.;6'), 

= Eg [{m X tf^y;Z (o,,, . . . , o,„_, , 0,„,; 0) a™ (0,„ ; 0)}] 
+ {/ - d,„^i,e} "L„_i,y,„ (o,i , . . . , o,„._, ; 



Em— 1 - psym / ^ 



since, by parts (i) and (ii) of the Lemma A.l and Equation (2.1), 



m — 1 

, • ■ • > Oi^-i , O, o,^.^, , . . . , o,;,„_i ; O) (O; 6') 



Here / is the identity operator. Now since the model / ^O; 9 (C)^ — f (O; 9) {1+ 

CmOm (O)} with Cs = for s < m has score a„i (O) and such scores are dense in the 
subspace of L2 {F {■;9)) with mean zero, it follows that if^^i „_! {oi^ , ■ ■ ■ , Oi^_^ ; 6*) 
has influence function 

X i/;„^;^ (o,, , . . . , o,;„_i , O; 6*) 

m— 1 

- X! */m-lm-l > • ■ • . Oj,-l : O, O,^^^ , . . . , Oj„_, ; 6*) 

J = l 



on the set Om-i- Thus 



*/™"l,,„-l(°n---.Oi™-i;-) 



(O.„;0) 



□ 



Corollary A. 2. For m>2, 

(A.2) He (9) |W^_2 (^)] = -He [lF,„,,„,v, (0) |Zi,„_i (9)] 

(A.3) IF^,v, (f?) = He [W,^.,„,,\i, (9) \U^_, (9)] 



Eg 



(A.4) 



m^-Ee ^CZm^^. (O^^ ' • ■ • ' O''^ ; ^) n Si. (O. 
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Proof of Equation (A. 2). By Lemma A.l and Theorem 5c(ii), 
He [lF„,™,v* (^) (^)] 

= V [mEe {lF22,j^^ (0) si, {O.J |0.,, . . . , 0.„._,)" 

= V mE0(^m~^drafi\ifi^^f^vrr^^ _^{Oi^,...,Oi _^;.) (C*'™! ^)} (^i™) I 
xO,i,...,Oi„_i)] . 
Now, by part (iii) of Lemma A.l and Equation (2.1), the RHS is 



Et 



[e [(m - 1)£; „_^(o,,...,o.„_,,) On, ■ • ■ , o,„_, 

xs;, (0.„)|0,,,...,0,„_,]} 

On the other hand, by part (iv) of the Lemma A.l, 
Ue [lF„_i,„_i,v, {9) \Ui_., {9)] 
= V [0)] 



V 



- V 



m — l,m — l,\/t,iTTi-i ^ ' ' ^ " 



(™ - 1) Eg 

Proof of Equation (A.3). Write 
IF„,v^ {9) = He [lF,„,„,^v^ (0) (0)] + {n [IF2,2,\/. {0) \Ui {9)] +IFi,\,^ (0)} 

7n— 1 

+ 5] {n [lF,+i,,+i,\;, (0) \U, {9)] + n [IF,,- {9) \U^_, {9)] ] 



□ 



J=2 



The RHS is He [IF„,„,\,^ (0) (0)] by Equation (A.2). 

Proof of Equation (A. 4). 



□ 



Ea 



E, 



n [lIF„,„,v,„^, (0) |Z^„Vi {9)] S^ j^^ (9) 

by Equation (A.3). But the RHS of this equation is the RHS of Equation (A. 4). □ 
Proof of Theorem 5c(i). By assumption 



[9) = Ee [Wra-l (^)S^_ij„_, {0) 

Hence 

^^j^ (0) ^ Ee (lF™_i (e)§„j„^ (0)) [lF,„_iAu (^)§™-iJ„_i (^) 

By Equation (A. 4), and the assumption if^^i „i_i (0;^, . . . ,Oi^;9) has an influ- 
ence function, we obtain 



Ee 



F„-i,\/„ i9)S^_^j_^{9)_ 
{m-l)\Ee ^_^(o.,,...,o.„_,,) (0»„;^)^u (O.„;0) H (0^ 



m— 1 



r=l 
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We conclude that IF,„^,„ exists and equals 

Proof of Theorem 3.14-- By Theorem 3.13, 
~ . (9) = if^ ~ (O,, ; 9) 

(b{9),p{9))-^k.{e) 

= h{o,,,h{X,,-9),p{X,,-9)^-ijk{9) 
and by part 5.c of Theorem 2.3, 



V 









IF ~ - 




¥ 









IF. 



(0) 



\ut'^^ {9) 



Now 



IF, 



M/ ~ (o.„ ).. = h~(o,,,h{X,,-9),p{X,,-9)\ IF^~, . (9) 



where 



/i- (O,, , b {X,, ■ 9) ,p{X,, ; 0)) = i/i.,, & (X,, ; 9) + iJg.^, ■ 



^u^«, { 



B,,Zk,IF~ ,^ . (61) 

1 l,'7fc(-).'2 ^ ^ 



and 



{9)^-hX,,[Ee 



PBH^ZkZ^ 



PBH.ZkZl 



} ' [{H,b{X;9) + H3'jpZk 



PBH.ZkZ^ 



}' 



^{H,piX;9) + H2},^ B,,zl^ [e, 
{Hib{X;9) + H3}PZk\^ 

[H^biX; 9) + [PBJ?i:^fe^r] 
{Hip{X;9) + H2}BZk 



and further 



V 



IF, 



(^) 



1^1 {0) 



= 0, 
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Since 



Ef, 



{H,p{X; 9) + H2} Bzi}^ = Ee [{i/16 {X; 6) + //g} PZ 



= 



and thus IF^ _ /q . \ (9) is degenerate. Because IF^ -j _ /q . \ (9) has two 

terms, it appears that IF^^ ~ j will consist of two terms. However by the symmetry 
upon interchange of ^2 and ii, and the permutation invariance of the operator V 



V 



IF, 



= V 



-2{H^p{X,9)+H2).,^Kzl, [Ee 
^ \[H^h{X,t 



Thus we can take 



IF^ 



-{H,p{X,9)+H2},J^,zl^[Ee 
Zk{Hib{X, 9) + Hs\p 



r 



PBHiZkZl 



r 



as was to be proved. We now complete the proof of the Theorem by induction. We 



assume it is true for IF ~ - and prove it is true for IF, , , , - 

mm,ipk,im (m+l)(?Ti+l),i/'fc,«m + i 



Now 



1. 

m 



He 



IF, 



i.if ~ (O- 

Now by the induction hypothesis, 



{9) 1^^^"""+^ {9) 



if 7 [O- ,t 

•'m,m,ipk V 

= (-1)™-! \(HiPi9) + H,) BZI 



\{{Ee 



. . _ _Tn-i [PBH.ZkZ^ 
PBHiZkZi^\j I ^ r^A„^ 



PBHiZkZ^ 
[Ee [pBHiZkZl] pk [h,B (9) + H3) P 



The derivatives with respect to the 9's in P (9) ,B (9) and in the m — 1 terms 

-1 



Ea 



PBH.ZkZ 



}' 



will each contribute a term to V 



However differentiating with respect to the 9 in the m — 2 terms E, 



PBHiZkZl 



will not contribute to V 



as the contribution from each 



of these to — 2 terms to IF, 



\ . ,. / „\ . (9) IS only a function of to units 
~ (0- ^ ' 

data and is thus an clement of Um [9) which is orthogonal to the space Um"''"*^ (9) 
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that is projected on. Now 
IF 



,{Ee 



{0) 



PBHiZkZ, 



PBHiZkZl 

" im+l 

-E„ PBHiZkZl 



Eh 



PBHiZkZ 



so upon permuting the unit indices, the contribution of each of these to — 1 terms 



to /F 



o- ,e)A„ 



(A.5) 



-i-iy 

m+l 



(0) is 



HiP {0) + H2 ] BZI 



PBHiZkZl 



[PBHiZkZ^ 

e/ ' ' 



PBHiZkZl 



Y{{Ee 

s=3 

[Eg [pBH^ZkZl] [Zfe (i/iB (0) + F3) P 



which is aheady degenerate ( i.e., orthogonal toUm {S))- Differentiating with respect 
to the 0's of P{0),B (0) in IF^ _ ^ (6*) we obtain 



s=3 

+ (-1) 



PBi/iZfeZfe ] } ' I (pBHiZkZl 
[Zfc [HiB{0) + H^^ P 











PBHiZkZl 


}] 



PBH^ZkZ^ 

771 — 1 



HiP{9)+H2) Bzl 



s=3 

\ZkHiP 



PBHiZkZ^ 



PBHiZuZ^ 











PBHiZkZl 


}] 



IF^ 



l,p(Xi2,-),i„+i 



Substituting in the above expressions for IF_^~^-^ ^ (0) and IF-^^ —^-^ ^ 

then projecting on Um"'"^^^ (0), and again permuting unit indices, we obtain two 
identical terms both equal to Equation (A.5). Thus we obtain m + l identical terms 

in all. Upon dividing by to+1, we conclude that V 

V operating on (A.5), proving the theorem. 



IF ,1^7- (6') equals 

□ 
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